You are on page 1of 14

DATA ANALYSIS AND INTERPRETATION LAB

SPSS TUTORIAL

PART I: DESCRIBING DATA, CONFIDENCE INTERVALS, CORRELATION (WEEK


8)

There is a data set posted on the BlackBoard site that you can use to follow along

FIRST OFF: HOW TO FIND/USE SPSS


On DePaul lab computers, go to PROGRAMS, then STATISTICAL SOFTWARE and
click on SPSS
Alternatively, you can open a pre-existing data set (like the one used for this lab) and it
should open up SPSS.

In SPSS, you can enter your data and calculate descriptive and inferential statistics using
fairly simple steps.

When entering data, I would advise entering as follows:

Participant Variable 1 (e.g., Variable 2 (e.g., Variable 3 (e.g., graduate school


ID year) GPA) plans)
Participant 1 3 3.2 0
Participant 2 2 2.5 1
Participant 3 4 2.8 1

List each participant (give them ID numbers—PROTECT CONFIDENTIALITY)


descending in the first column.
Each of the following columns represents one of your variables. In SPSS, make sure to
CODE all variables into NUMERIC values. You can use SPSS to compute statistics for
numbers (not written variables). For example, Year (1= Freshmen, 2= Sophomore, 3=
Junior, 4= Senior). You can also DICHOTOMIZE variables (assigning them a score of 0
or 1, depending on presence or absence—good for things like gender, presence/absence
of a behavior [observational research]). For example, Graduate School Plans... 0= no
school, 1= planning on going to grad school.

STEP 1: GETTING TO KNOW THE DATA


EXAMINE GENERAL FEATURES OF DATA, LOOK FOR OUTLIERS,
ANOMALIES, IMPOSSIBLE NUMBERS IN YOUR DATA

Go to ANALYZE, then DESCRIPTIVE STATISTICS in the dropdown menu, and then


FREQUENCIES.
Move variables of interest over into the column on the right. Click the box for DISPLAY
FREQUENCY TABLES. Next hit CHARTS. From here you can click whatever chart
type you are comfortable with. I generally recommend choosing HISTOGRAM, and I
would also click on checkbox for WITH NORMAL CURVE. Then hit OK.
You will see a print out like this:
Frequencies

Statistics
IQ
N Valid 88
Missing 0

(the Valid lets you know how many pieces of data were entered for this variable [also see
the N—Sample Size], Missing lets you know how many data points in this variable are
not entered/answered)

IQ
Frequency Percent Valid Cumulative
Percent Percent
Valid 75 1 1.1 1.1 1.1
79 1 1.1 1.1 2.3
81 2 2.3 2.3 4.5
82 3 3.4 3.4 8.0
83 2 2.3 2.3 10.2
84 2 2.3 2.3 12.5
85 3 3.4 3.4 15.9
86 2 2.3 2.3 18.2
88 3 3.4 3.4 21.6
89 2 2.3 2.3 23.9
90 1 1.1 1.1 25.0
91 3 3.4 3.4 28.4
92 2 2.3 2.3 30.7
93 2 2.3 2.3 33.0
94 1 1.1 1.1 34.1
95 6 6.8 6.8 40.9
96 2 2.3 2.3 43.2
97 1 1.1 1.1 44.3
98 2 2.3 2.3 46.6
99 1 1.1 1.1 47.7
100 3 3.4 3.4 51.1
101 2 2.3 2.3 53.4
102 3 3.4 3.4 56.8
103 2 2.3 2.3 59.1
104 1 1.1 1.1 60.2
105 3 3.4 3.4 63.6
106 4 4.5 4.5 68.2
107 3 3.4 3.4 71.6
108 3 3.4 3.4 75.0
109 3 3.4 3.4 78.4
110 1 1.1 1.1 79.5
111 4 4.5 4.5 84.1
112 1 1.1 1.1 85.2
114 1 1.1 1.1 86.4
115 2 2.3 2.3 88.6
118 3 3.4 3.4 92.0
120 2 2.3 2.3 94.3
121 1 1.1 1.1 95.5
127 1 1.1 1.1 96.6
128 1 1.1 1.1 97.7
131 1 1.1 1.1 98.9
137 1 1.1 1.1 100.0
Total 88 100.0 100.0

(this table lets you know how many people scored/responded with a certain score
[Frequency] and the percentage of total participants answered that same way)

IQ
14

12

10

4
y
c
n
e 2 Std. Dev = 12.98
u
q Mean = 100.3
e
rF 0 N = 88.0 0
7 5.0 8 5.0 9 5.0 1 05.0 115.0 1 25.0 1 35.0
8 0.0 9 0.0 1 00.0 1 10.0 1 20.0 1 30.0

IQ

(this is a histogram of the chart—it gives you a visual view of what your data
distribution/variability of the data, as well as the central tendencies, including how
NORMAL the data is—again, NORMAL data will look somewhat like a bell-curve)

OUTLIERS/ANOMALIES/IMPOSSIBLE NUMBERS
So what happens when we find an outlier or “odd” numbers?

Try doing the above steps with the “var00001” variable.


What would you do with data that looks like this?

STEP 2: SUMMARIZING/DESCRIBING THE DATA


DESCRIBE CENTRAL TENDENCIES AND DISTRIBUTION OF THE DATA

Summarizing the data is similar to the above step (in fact, you can do them both at the
same time!).
Click on ANALYZE, then DESCRIPTIVE STATISTICS, and then FREQUENCIES
again. From here click on STATISTICS. Now you can choose any of the statistics listed
here. I would recommend the ones we talked about in class, such as MEAN, MEDIAN,
MODE, RANGE, STANDARD DEVIATON, STANDARD ERROR OF THE MEAN
(S.E. OF THE MEAN). You will get a similar output with the addition of a table that
looks like this:

Statistics
IQ
N Valid 88
Missing 0
Mean 100.26
Std. Error of Mean 1.384

Std. Deviation 12.985

Range 62

STEP 3: CONFIRM WHAT THE DATA REVEAL

CONFIDENCE INTERVALS (CI)

CONFIDENCE INTERVALS FOR A SINGLE MEAN


To estimate confidence intervals for a single mean, you can click on GRAPHS then
ERROR BAR. Next, click on SIMPLE and SUMMARIES OF SEPARATE
VARIABLES. Then move over the variables of interest into the box to the right (I would
recommend doing this one variable at a time). Then from the BARS REPRESENT
dropdown menu select CONFIDENCE INTERVAL FOR THE MEAN. Make sure you
have selected 95% for the level (you can choose others, but 95% is the norm). Click OK.
You will see a box that looks like this:
1 04

1 03

1 02

1 01

1 00

99

IQ 98
I
C
%
5
9 97
N = 88
IQ

(the dot in the middle represents the SAMPLE MEAN and the bars above and below
represent the UPPER and LOWER CI)

You can also go to ANALYZE, then DESCRIPTIVE STATISTICS, and then EXPLORE.
Then move variable of interest into the DEPENDENT LIST to the right. Make sure you
have STATISTICS or BOTH checked for DISPLAY. Then hit OK. You will see a print
out like this:

Descriptives
Statistic Std. Error
IQ Mean 100.26 1.384
95% Confidence Interval for Lower Bound 97.51
Mean
Upper Bound 103.01
5% Trimmed Mean 99.78
Median 100.00
Variance 168.609
Std. Deviation 12.985
Minimum 75
Maximum 137
Range 62
Interquartile Range 18.50
Skewness .394 .257
Kurtosis -.163 .508
(this gives us a numerical summary of the CI versus a visual display, as seen above)

CONFIDENCE INTERVAL BETWEEN INDEPENDENT GROUP MEANS


To view CI of difference on a measure (DV) across different groups (IV), take similar
steps as above. Click on GRAPHS then ERROR BARS. Again, click on SIMPLE, but
now click on SUMMARIES FOR GROUPS OF CASES—click DEFINE. Then move
the variable of interest over into the VARIABLE section and whatever you want to group
it by (e.g., year in school, gender, etc.) into the CATEGORY AXIS box. Hit OK. You
will see a box similar to this:

1 10

1 00

90

IQ
I
C
%
5
9 80
N = 78 10

0 1

DROPOUT

(this shows the MEAN and CI for the variable [IQ] across the different groups
[DROPOUT] as well as the CI for each group)

Similar to the example above (CI OF SINGLE MEAN), you can go to ANALYZE, then
DESCRIPTIVE STATISTICS, and then EXPLORE. Put the variable of interest in the
DEPENDENT LIST and the variable you wish to group it by in the FACTOR LIST. Hit
OK. You should have a print out like this:

Descriptives
DROPOUT Statistic Std. Error
I 0 Mean 101.65 1.467
Q
95% Confidence Interval for MeanLower Bound 98.73
Upper Bound 104.58
5% Trimmed Mean 101.25
Median 102.00
Variance 167.918
Std. Deviation 12.958
Minimum 75
Maximum 137
Range 62
Interquartile Range 17.50
Skewness .281 .272
Kurtosis -.180 .538
1 Mean 89.40 2.130
95% Confidence Interval for MeanLower Bound 84.58
Upper Bound 94.22
5% Trimmed Mean 89.50
Median 89.50
Variance 45.378
Std. Deviation 6.736
Minimum 79
Maximum 98
Range 19
Interquartile Range 13.50
Skewness -.267 .687
Kurtosis -1.377 1.334

(this gives us numeric info on the lower and upper CI for both groups [0 and 1])
PART II: TESTS OF STATISTICAL SIGNIFICANCE AND THE “ANALYSIS
STORY”

STEP 3: CONFIRM WHAT THE DATA REVEAL continued...

Last time we discussed CI’s and how they can be used to support our hypotheses

Now we will discuss Null Hypothesis Significance Testing (NHST)


This is the most common approach for data analysis

Goal of NHST is to determine whether mean differences among groups in an experiment


are greater than differences expected simply because of chance (error variation)

First step-- assume that the groups do not differ. This is called the Null Hypothesis (H0)
Assume the independent variable did not have an effect

Step 2-- Probability theory: estimate likelihood of observed outcome, while assuming H0
is true
This is what we mean by “statistically significant”—statistical significance is different
from scientific significance or practical/clinical significance
(Note: you don’t have to necessarily know the differences between these, but just know
that something can be statistically significant without being practically significant)

So what does “statistically significant” mean?


The outcome has small likelihood (<5% chance= p< .05) of occurring under H0

Step 3: Run statistical analyses— p < .05...

If this is the case, you reject H0 – conclude there is an effect of IV on the DV!
So, the difference between means is larger than what would be expected if error variation
(random chance) alone caused the outcome

What do we conclude when a finding is not statistically significant (p > .05)?


We do not reject the H0 -- no difference, no effect
BUT, we also don’t accept the H0!
We don’t conclude that the IV didn’t produce an effect
We cannot make a conclusion about the effect of an IV
Some factor in experiment may have prevented us from observing an effect of the IV
Most common factor: too few participants!!!!
So, not matter what you find, make an argument as to WHY you could not find support
for your hypothesis-- do not say that your hypothesis is incorrect

Due to the nature of probability testing, it is possible that errors can occur with our
findings!

Types of errors:
Type I error: null hypothesis is rejected when it really is true
We observe statistically significant finding (p < .05)
But in truth, there is no effect of IV
Probability of making Type I error = alpha (α)
Setting level of significance at p < .05 indicates researchers accept probability of Type I
error as 5% for any given experiment

Type II error: null hypothesis is false but it’s not rejected


Claim effect of IV is not statistically significant (p > .05)
But in truth, there is an effect of the IV
Experiment missed the effect

Because of the possibility of Type I and Type II errors researchers are always tentative
about their claims
Use words such as “findings support the hypothesis” or “consistent with the hypothesis”
Never say the hypothesis was proven!!!

There is important info about statistical power and sensitivity in the chapter you should
be familiar with, but in interests of getting through material, I am leaving you to cover
and be familiar with that information

TESTS OF STATISTICAL SIGNIFICANCE

INDEPENDENT GROUPS/SAMPLES T-TEST


If you’re trying to demonstrate that 2 or more groups are different on some
measure/variable
Let’s see an example from our previous data set—
Remember our above example where we examined the CI for IQ across 2 groups—HS
dropout versus no-dropout?

Let’s now do an INDEPENDENT SAMPLES T-TEST to determine if this difference is


STASTICALLY SIGNIFICANT

Go to ANALYZE, then COMPARE MEANS, then INDEPENDENT SAMPLES T-TEST.


Then put the variable of interest in the TEST VARIABLES section and the group variable
in the GROUPING VARIABLE section. Now you have to DEFINE the GROUPING
VARIABLE (you can define it in whatever numbers/terms you want, just make sure it
makes sense to you). Then click on OK. You will get a print out that looks like this.:
Group Statistics
DROPOUT N Mean Std. Std. Error
Deviation Mean
IQ 0 78 101.65 12.958 1.467
1 10 89.40 6.736 2.130

Independent Samples Test


Levene's Test for t-test for Equality of Means
Equality of
Variances
F Sig. t df Sig. (2- Mean Std. Error 95% Confidence Interval
tailed) Difference Difference of the Difference
Lower Upper
IQ Equal variances 3.951 .050 2.929 86 .004 12.25 4.183 3.938 20.569
assumed

Equal variances 4.737 19.064 .000 12.25 2.587 6.841 17.666


not assumed

Note the Levene’s Test for Equality of Variances—this tests whether the variances
(standard deviation/standard error of the mean) for the 2 groups are equal (we want it to
be equal). If this test is NOT SIGNIFICANT (p > .05—which is what we want), then we
look at the top row—EQUAL VARIANCES ASSUMED. If this test IS SIGNIFICANT
(p< .05) we use the lower row—EQUAL VARIANCES NOT ASSUMED. What would
we do in this case?
Next we look at the t value, and whether it’s significant. In this case, the t value is
significant, so we would reject the null hypothesis—we have support that there are
differences between the groups.

ANALYSIS OF VARIANCE (ANOVA)


Used if we are trying to determine 3 or more groups differ from each other on some
variable/measure

I have created a new variable in the data set called ADDLVL. This is a variable the
groups participants into 3 groups based on how many ADHD-related problems the
participants had (1= lower, 2= middle, 3= higher). I want to see if these 3 groups differ
on their IQ.
Go to ANALYZE, then COMPARE MEANS, and then ONE-WAY ANOVA. Put the
variable of interest in the DEPENDENT LIST (in this case, IQ) and the grouping variable
in the FACTOR section. You don’t have to make any other changes yet (although I
would go to OPTIONS and click on DESCRIPTIVES... it can never hurt).
You should see an output something like this:

Descriptives
IQ
N Mean Std. Std. Error 95% Confidence Minimum Maximum
Deviation Interval for Mean
Lower Upper
Bound Bound
1.00 24 113.71 10.407 2.124 109.31 118.10 90 137
2.00 51 96.29 10.356 1.450 93.38 99.21 75 120
3.00 13 91.00 6.819 1.891 86.88 95.12 79 102
Total 88 100.26 12.985 1.384 97.51 103.01 75 137

ANOVA
IQ
Sum of df Mean F Sig.
Squares Square
Between 6257.442 2 3128.721 31.616 .000
Groups
Within 8411.547 85 98.959
Groups
Total 14668.989 87

Here, we can see that the ANOVA (F-test) is significant (F= 31.616, p= .000). This tells
us that at least 2 of our groups are statistically significant. BUT WHICH ONES?

To figure this out, we need to run POST HOC analyses. This basically means “after
this,” or in this case “after this test.” POST HOC analyses should ONLY be run if the
ANOVA/F-TEST was significant at the first step (like above). If it was not, there is no
point in running POST HOCs.
There are many different types of POST HOCs that use different assumptions and meet
certain requirements. For this course, I will not make you figure out the many different
ways in which POST HOCs are different. The most common POST HOC is Tukey’s- so
feel free to use this POST HOC should you need it for your analysis.

You calculate POST HOCs similar to running a regular ANOVA. Like before, go to
ANALYZE, then COMPARE MEANS, and ONE-WAY ANOVA. Now- click on POST
HOC. From here, click the desired test (Tukey’s in this case). Then hit OK. You should
have a new table that looks like this:

Multiple Comparisons
Dependent Variable: IQ
Tukey HSD
Mean Std. Error Sig. 95%
Difference Confidence
(I-J) Interval
(I) addlvl (J) addlvl Lower Upper
Bound Bound
1.00 2.00 17.41 2.462 .000 11.54 23.29
3.00 22.71 3.426 .000 14.54 30.88
2.00 1.00 -17.41 2.462 .000 -23.29 -11.54
3.00 5.29 3.091 .206 -2.08 12.67
3.00 1.00 -22.71 3.426 .000 -30.88 -14.54
2.00 -5.29 3.091 .206 -12.67 2.08
* The mean difference is significant at the .05 level.
From here, we can see that GROUP 1 is significantly different from GROUP 2 and
GROUP 3. We can also see GROUP 2 and GROUP 3 are NOT significantly different
from each other. Look at the MEAN DIFFERENCE (I-J) as well as the DESCRIPTIVES
(i.e., MEAN) from before to determine HOW they differ. In this case, GROUP 1 had
higher IQ than GROUP 2 and GROUP 3.
CORRELATION
Next we will run a correlation test. Correlations are used to see whether or not 2 or more
variables are related. So, you can use a correlation to determine the relation between
how a person scores on 2 items/variables/measures. By using this procedure, you can
also use the correlation/relation to predict one variable using another. We’ll talk more
about this in class.

How to run a correlation in SPSS—go to ANALYZE, then CORRELATE, then


BIVARIATE. From here, you would place the variables you are interested in seeing
how/if they are related into the VARIABLES column to the right. For CORRELATION
COEFFECIENTS, click on PEARSON’S (this gives you your r vale—remember this
value from statistics?). On TEST OF SIGNIFICANCE, click on TWO-TAILED (we’ll
talk about the difference between one-tailed and two-tailed later). Make sure FLAG
SIGNIFICANT CORRELATIONS is checked. You can also click on OPTIONS and then
MEANS AND STANDARD DEVIATIONS. Hit OK and you should have a print out
that looks like this:

Correlations
IQ GPA
IQ Pearson 1 .497**
Correlation
Sig. (2-tailed) . .000
N 88 88
GPA Pearson .497** 1
Correlation
Sig. (2-tailed) .000 .
N 88 88
** Correlation is significant at the 0.01 level (2-tailed).

In a correlation table, what you want to look for is the PEARSON value (r value) and the
SIG row. In the table, you see that the correlation runs the correlation test for how IQ and
GPA are related, but also how IQ is related to IQ and GPA is related to GPA (see above).
When a variable is correlated with itself, the PEARSON value will be 1 (perfect
correlation). You can ignore those scores.

What you’re interested in looking at is how IQ is related to GPA. In this case, the
PEARSON value between these is .497 and the sig (p) is .000. In this case, the
correlation between these 2 variables IS SIGNIFICANT.

The PEARSON/r value tells us the DIRECTION and the STRENGTH of the relationship.
You might remember from Statistics that a correlation can be NEGATIVE or POSITIVE.
A POSITIVE correlation means that both variables move in the SAME DIRECTION—as
one variable INCREASES, the other INCREASES; or as one variables DECREASES,
the other DECREASES. An example of this is as “time spent studying for exam”
INCREASES, “exam score” increases. Alternatively, as “time spent studying for
midterm” DECRASES, “exam score” DECREASES.”

If it is NEGATIVE or INVERSE, that means that the variables move in OPPOSITE


DIRECTIONS-- as one variable INCREASES the other DECREASES. One example of
this might be as “stress” INCREASES, school performance might DECREASE.

On the SPSS print out, you can determine whether the correlation is POSITIVE or
NEGATIVE by looking at the PEARSON value—is it a POSITIVE or NEGATIVE
number? In this case, it is POSITIVE, meaning as IQ increases, GPA increases.

The other thing to look for in correlation is the STRENGTH of the relationship. A
correlation score (this is our PEARSON or r value) can range from anywhere between -1
and +1. Here is an example distribution of possible correlation scores:

-1 -.75 -.50 -.25 0 .25 .50 .75 1

The closer the correlation is to 1 or -1, the STRONGER it is—the variables are more
closely related and it will be easier to find a significant relationship. The closer they are
to 0, the WEAKER the relationship is, and it may be harder to find a significant
relationship.

You can also run a correlation between 3 or more variables. You would follow the above
procedures (ANALYZE, CORRELATE, BIVARIATE), but now you would just add
additional variables. Make sure all the same things are clicked as noted above and hit
OK. You will now get a print out that looks like this:

Correlations
IQ GPA ADDSC
IQ Pearson 1 .497** -.632**
Correlation
Sig. (2-tailed) . .000 .000
N 88 88 88
GPA Pearson .497** 1 -.615**
Correlation
Sig. (2-tailed) .000 . .000
N 88 88 88
ADDSC Pearson -.632** -.615** 1
Correlation
Sig. (2-tailed) .000 .000 .
N 88 88 88
** Correlation is significant at the 0.01 level (2-tailed).

Now you will notice the table has gotten bigger and the test will test the correlation
between each variables separately (e.g., IQ—GPA, IQ—ADDSC, GPA—IQ, GPA—
ADDSC, ADDSC—IQ, ADDSC—GPA). You would read the table the same way as
above, focusing on the PEARSON/r value and the Sig level.
What can we determine from adding the new variable into this correlation?

We’ll now talk a little bit about how to COMPUTE and TRANSFORM data and
variables in SPSS. I will not include this info on BlackBoard-- this material will not be
included on the exam, but might be helpful for the final project.

You might also like