You are on page 1of 61

The Very Basics of SPSS (ver.

16 and up, Windows)

Statistical Computing Group @ Research Data Services


SAS, University of Pennsylvania
Last modified: 03/16/2009

This online seminar is to help you get started with basic data management and analysis in SPSS.
It is for those people who:

ƒ Are new to statistical data work and want to learn to use SPSS to manage data and
perform common analysis.
ƒ Took some undergraduate (or perhaps some graduate) introductory statistics course in
SPSS long ago and want to refresh their memory.

SPSS is one of the most user-friendly commercial statistical packages. As such, even beginners
of statistical analysis would find its point-and-click and dialogue boxes interface very
approachable and easy to use. However, in the long run, you will benefit a lot more by learning
SPSS by SPSS Syntax. The pros of the syntax approach are:

• An efficient way for documentation and reproducibility (this is the reason I would
strongly discourage you from keeping relying on the point-and-click approach).
• Much quicker and efficient once you learn how to write and run syntax commands.
• Can perform things unavailable/inaccessible from the point-and-click menus.

So, this online seminar attempts to prepare you for writing syntax commands yourself in the
future to perform simple data tasks and run basic procedures.

Aside from this online seminar, you have access to a lot of great instructional SPSS resources
online for free. We strongly recommend that you use those resources to the full.

This online seminar assumes that you are using SPSS ver.16 and up. Be aware there are
some significant changes between ver.15 and before and ver.16 on.

Contents:
1. Getting Started: Let’s Open SPSS and Bring in Data................................................................. 2
2. How to Get Descriptive Statistics and Graphs.......................................................................... 14
3. How to Define Variable Properties........................................................................................... 29
4. How to Create and Recode Variables ....................................................................................... 33
5. How to Subset (Select) Data ..................................................................................................... 45
6. How to Sort and Split Data ....................................................................................................... 53
7. Simple Regression Example ..................................................................................................... 57

1
1. Getting Started: Let’s Open SPSS and Bring in Data

In this section, we will play with the SPSS windows and get a broad idea of what things look like
in the SPSS environment. In so doing, we also learn how to bring data in SPSS. By the end of
this section we will have a rough but good idea what role each window plays for your data
management and analysis work and how to get SPSS ready for our work.

First, let’s create a working directory for this practice in whatever location you prefer. It is
always a good idea to keep one project directory for one project. Let’s call this new working
directory “verybasicSPSS.” Then download from here the data files we are going to use in this
online workshop. Save them in the working directory you just created.

Now, let’s launch the program.

ƒ Click on the SPSS icon, OR


ƒ Choose from Window’s Start menu SPSS 1x.x (whatever version you have; this
workshop is using Version 16) for Windows:

Start Programs > SPSS for Windows > SPSS 1x.x for Windows

Some of you may have a dialogue box (called “SPSS for Windows 1x.x [your version]”) popping
up asking “What would you like to do?”. In that case, let’s just click on “Cancel” for now and go
see what windows we have in the SPSS interface.

You should be seeing an untitled SPSS Data Editor window now (like below).

The Data Editor window shows you the working (= currently open) dataset in a spreadsheet
format. Of course, it is now new and empty. You see there are two sheets in this window, the
Data View and the Variable View. Currently, the Data View is active (in yellow). You see a
message from SPSS at the bottom of the window. Currently, it is “SPSS Processor is ready” for
your work.

2
Before getting started with our work, let’s change the output settings. From the menu bat at the
top (of whichever window),

Edit
Options…

This will bring you the “Options” dialogue box. Here, you can control what to display in your
output. Click the “Viewer” tab. Here is one setting I strongly recommend that you choose:
“Display commands in log.” You will see why in a moment. For now, just check the box and
click OK.

Now, let’s first bring in a data. We’ll use these files for this practice.

ƒ xls_gss93.xls
ƒ csv_gss93subset.csv
ƒ fix_gss93subset.dat
ƒ GSS93 subset.sav

* The second and third files are subsets of the last one “GSS93 subset” data (7 variables, 97 observations) in
different formats for our practice purpose.

Reading Data from Excel Files

We will start with importing the excel file “xls_gss93subset.xls” into SPSS. First, open the excel
file and understand how it is formatted. The first row has variable names, and the data part is
from the second row and below. Close the excel file and let’s start reading this file into SPSS.

3
Start SPSS by clicking on the SPSS icon or from the Window’s Start menu. From the SPSS
menu bat at the top, go:

File
Open
Data…

This brings up a dialogue box “Open Data” as shown below. Go to your working directory
“verybasicSPSS,” then select “Excel (*.xls)” format from “Files of type.” Then select the excel
file “xls_gss93.xls” you saved there. Then click Open. (see below for a visualized instruction).

(1) Go to your working


directory where you
saved the excel file.

(3) Select this file.

(2) “Files of type” is (4) Click Open.


Excel. This brings up
our excel file in the
above window.

Now you should be seeing another dialogue box “Opening Excel Data Source.”

4
As we first checked, the excel file has variable names in the first row. So check the “Read
variable names from the first row of the data” box. Click OK. Now you have a new, unsaved
data in another SPSS Data Editor window. To save the data in the SPSS format, go from the pull-
down menu:

File
Save

Let’s save it in your working directory with the name “xls_gss93.” Let’s keep this data for a
moment.

Reading Data from Text Files (comma-separated-values)

Okay, we are next try importing “csv_gss93subset.csv,” a text file in the comma separated values
format. The first line contains variable names. From the menu bar at the top,

File
Open
Data…

This again should bring up a dialogue box called “Open Data.” Make sure you are looking in
your working directory “verybasicSPSS.” Since our file extension is .csv, we need to select “All
Files(*.*)” from “Files of type.” Then “csv_gss93subset.csv” shows up in the window. Select it
and click Open.

5
Then the “Text Import Wizard” dialogue box shows up, which has six steps. Click Next, and in
Step 2 of 6, check on the “Yes” radio button to the question “Are variable names included at the
top of your file?” because we do have variable names in the first row, and else accept the default
settings and keep moving on by clicking Next. Then in Step 6 of 6, you will see click Finish.

You should see another SPSS Data Editor window [DataSet2] popping up. As is clear, multiple
data files can be simultaneously open in SPSS (we’ll mention about this a bit more later).
Browse the one you just read in, and let’s just close it without saving.

6
Reading Data from Text Files (ASCII fixed format)

Finally, let’s practice reading the data “fix_gss93subset.dat,” which is an ASCII fixed format file.
This type of data always comes with a codebook that specifies which column corresponds to
which variable. Take a look at this text file (left below; notice there is no variable name header)
and its codebook (right below).

11320 3143 Variable Name Column Number


215 0 2044
31325 2043 id 1-4
425 0 4045 wrkstat 5
555 0 1078 marital 6
65125 2283
71122 2255 agewed 7-8
85124 3275 sibs 9-10
91322 1231 childs 11
1025 0 1054
age 12-13

Now, unlike the previous examples, there is no easy point-and-click method to read this type of
data. What do we do then? The best way is to write syntax commands ourselves to bring in this
data. Let’s open a new syntax file for this work. From the menu bar at the top, go:

File
New
Stntax

You now should be seeing the SPSS Syntax Editor, which is another important window in the
SPSS environment (as I emphasized in the introduction, you should eventually learn to use and
write the Syntax file for your work. You are now getting a little glimpse of it…). Let’s save it as
“verybasicspss” in your working directory. Now, let’s type the following commands (be sure to
specify the file location where you saved the file “fix_gss93subset.dat”).
Always end your
data list fixed file='[specify your working directory]\fix_gss93subset.dat' comment with a period.
/ id 1-4 wrkstat 5 marital 6 agewed 7-8 sibs 9-10 childs 11 age 12-13.

The command DATA LIST is to read a text format data file by assigning names and formats to
each variable in the file. The keyword FIXED follows to tell SPSS that our data is a fixed format
(actually, this is the SPSS default so you can skip it). The command FILE = “file location/name
here” specifies your fixed format file and its location. After the slash (/) we provide SPSS with
variable definitions (the variable names and column numbers) from the codebook.

Two syntax rules you must remember here:

1. Notice that the whole command ended with a period (“.”). In SPSS, each command in
SPSS must be completed with a period “.”.
2. SPSS Syntax is NOT case-sensitive.

7
Now, let’s execute our commands. First, highlight them, then to run the highlighted part, hit the
Run Current button or alternatively hit Ctrl + R keys. When you run the above command,
another Data Editor window should open for this new data. But what did you get there?

You should be seeing a blank spreadsheet under the “*Untitled4 []” heading, although from the
Variable View it looks like SPSS seems to have variable information. Why aren’t we seeing the
data itself?

To read the data, we need to run another command to actually use this data (because to use this
data, SPSS needs to read it!). Get back to your Syntax Editor, and first make sure we are working
on this data set.
This pull-down menu
indicates your active data
source.

When you have multiple data files open at the same time, you need to tell SPSS which data file
you are working on (which is called “Active” file). You can make your file active by simply
clicking anywhere in the Data Editor window of the data you want to use (in this case,
“*Untitled4 []”), or when you have your Syntax Editor open, you can use the pull-down menu (in
this case, it should be set to “Unnamed” since “*Untitled4 []” is neither saved nor named).

Once you make sure “*Untitled4 []” is active, type in the following command (don’t forget a
comma), highlight and run it.

list.

Now, what do you have in your Data Editor and Output Viewer? You should now be seeing the
data content in the Editor, and the command LIST is executed and the result is in the Output
Viewer.

The point is this: SPSS just keeps it in its memory and does not read the data until it needs to,
because that’s efficient in terms of processing. In this example, SPSS encounters the procedural
command LIST, realizes it needs the data “*Untitled4 []” to execute LIST and produce results on
that data, and only at that moment does it read in the data.

8
But suppose you want to explicitly force a data pass so that you can immediately see the read-in
data in the Data Viewer. The command EXECUTE does that for you. If you run the following,

data list fixed file='[specify your working directory]\fix_gss93subset.dat'


/ id 1-4 wrkstat 5 marital 6 agewed 7-8 sibs 9-10 childs 11 age 12-13.

execute.

… then you would immediately see the result in your Data Editor without running any
procedural command. EXECUTE forces all the data to pass (including the data transformation,
where you for example create or recode variables and need to read the new data with those new
variables so you can use them), but it does nothing else to the session. It just forces a data pass.

But as I said, SPSS reads the data as it needs to after all, so in most cases EXECUTE is rarely if
ever necessary. In fact, to use EXECUTE at every single data transformation command slows
down the processing because SPSS is forced to read the data at every single EXECUTE, even
when data reading is unnecessary at that moment. So you should use EXECUTE sparingly. We
will be back to this command later and discuss a couple of situations where you absolutely must
run EXECUTE.

Anyway, let’s take a look at our output.

We checked “Display commands in log”


in the Options menu, so SPSS displays
the syntax it ran on the output.

9
You see the data content listed. You can save your output by going from the drop-down menu.

File > Save As…

The file extension for the SPSS output is .spv (Note: SPSS older than version 16 has the
extension .spo for the output files. To open and view .spo files in SPSS version 16 or later, you
need to install SPSS Legacy Viewer. For more information, see the SPSS technical support
website). The left-side pane is SPSS’s outline view of your output. It serves like a table of
content and allows you to navigate different parts of your output by clicking on small output
icons.

You also see why I strongly recommended you set your “Options” to “Display commands in
log.” Notice that in the output, you see all the syntax we have run so far printed out, even those
you didn’t write yourself. This is why I strongly recommended that you set “Display commands
in log” in the “Options” menu. First, having the actual commands you run along with the
corresponding output helps you greatly with documentation. You can always see what command
and options you used to generate the output that you have, and you can always keep track of
exactly what you did with the data. This is very important.

Further, be aware that SPSS syntax like those is running beneath the point-and-click interface,
even when you simply use those pull-down menus without writing syntax commands yourself and
do not see the actual commands SPSS runs. As mentioned earlier, you should eventually learn to
write commands by using your Syntax Editor yourself and run them from there, instead of
pointing and clicking. This is also very important for documentation and reproducibility.

Had you written and run the following syntax commands yourself, you would have gotten the
same results. They were the syntax running beneath your pointing and clicking.

Read an Excel file


get data
/type = xls
/file = ' [specify your working directory]\verybasicSPSS\xls_gss93.xls'
/sheet = name "xls_gss93subset"
/readnames = on.
execute.

Read a Comma-Separated-Values file


get data
/type = txt
/file = '[specify your working directory]\verybasicSPSS\csv_gss93subset.csv'
/firstcase = 2
/delimiters = ","
/variables =
id f2.0
wrkstat f1.0
marital f1.0
agewed f2.0
sibs f2.0

10
childs f1.0
age f2.0.
execute.

The command GET DATA is to read external files into SPSS.

For further syntax help, you always can go from the menu bar at the top,

Help
Command Syntax Reference

We will learn some additional basics of command writing throughout this workshop.

We can also directly input the data from the Syntax Editor. Type the following lines.

* Read dataline from syntax file .


data list / id 1-3 sex 5 (A) age 7-8 treat 10.
begin data
001 f 43 0
002 f 25 1
003 m 36 0
end data.

We use two commands. One is DATA LIST (we already learned it), and the other is a pair of
BEGIN DATA and END DATA. Again notice each command is finalized by a period at the end.
BEGIN DATA and END DATA are used when data are entered within the command sequence,
and data records are placed in between.

One important thing you need to remember from this example is this part:

sex 5 (A)

By default, SPSS treats variables as numeric. The variable sex here is a character variable (f/m).
By putting (A) after the variable name and the column number, you tell SPSS that this is a
character variable.

Open SPSS files

Now, let’s open an SPSS system file “GSS93 subset.sav.” This is actually the easiest part. From
the menu bar of the SPSS Data Editor window at the top, go:

File
Open
Data…

11
Find and open “GSS93 subset.sav” by double-clicking on it or choosing it and hitting OK.

There you go. Let’s click on the tab of the “Variable View” sheet and see what you have there.

Here’s what you should be seeing now.

12
Can also get each variable’s information here.

The Data View sheet and the Variable View sheet look very similar, but the latter has
information about the variables in the data set shown in the Data View sheet, including variable
names, data type (Numeric or string, etc), variable and value labels, how the missing values are
coded, etc.

Most of those information cells have hidden dialogue boxes or pull-down menus which you can
call up by selecting the cell and then clicking on the gray button that comes up on the right side
of the cell. For example, let’s try activating the dialogue box for the values of the variable
“marital.”

(1) Click on this


button

13
(2) Then value labels
dialogue box for the
variable “marital”
shows up.

Now, technically this box allows you to define/modify values of the variable. However, I just
brought this up to warn you in case you happen to find it and want to use it. DON’T USE THIS
BOX for the data management purposes. Although it looks easy, to use this dialogue box is
dangerous. It makes it extremely difficult to keep track of changes you made to the data, because
it does not leave any record of your action. You should use the Syntax Editor instead. Let’s just
click Cancel to close the “Value Labels” dialogue box.

Let’s close all the data sources other than the one we just read in, “GSS93 subset.sav.”

2. How to Get Descriptive Statistics and Graphs

In this section, we will learn how to explore our data by running simple descriptive statistics and
drawing graphs. We do both the point-and-click approach and the syntax approach.

Let’s get started. We still have the “GSS93 subset.sav” data in SPSS (if not, open it). We first
want to have descriptive information of the variable “educ.” From the menu bar at the top, go:

Analyze
Descriptive Statistics…
Descriptives

This brings you a dialogue box named “Descriptives.” You have a list of the variables in the left
pane. Let’s select the variable educ (“Highest Year of School Completed”) by double-clicking on
them, OR by highlighting the variable (you can select multiple variables by clicking on them
while pressing and holding the Ctrl key) and then hitting the arrow button between the two panes.

Then let’s click the Options… button and you have the “Descriptives: Options” dialogue box.
We can check boxes of statistics we want to see. So suppose we want to check mean, standard
deviation, minimum, maximum, and skewness values of these two variables. Click Continue.
You are back to the “Descriptives” box.

14
(1) Select variables
by highlighting
them and hitting

(2) Options brings up


the “Descriptives:
Options” dialogue box.

(3) Click
(4) Then click Continue after
Paste. selecting options
you want.

Okay, we are ready to get descriptive statistics for this variable. Now, let’s click Paste. What did
you get? You should now have got an SPSS Syntax Editor window like the one below.

SPSS commands are


completed with a period.

What we did here just now is just to paste the syntax command that SPSS writes to obtain
descriptive statistics. As I emphasized, always be aware that SPSS syntax commands like this are
running beneath the point-and-click interface, even when you simply use those pull-down menu
and click OK. You should learn to write SPSS syntax yourself eventually.

Now, let’s take a look at the pasted command. The SPSS command to get descriptive statistics is
DESCRIPTIVES followed by its subcommand VARIABLES = varname. The most basic
structure of SPSS syntax command language is:

COMMAND <options if any>


/ [SUBCOMAND <options if any>] .

The slash (/) is to separate subcommands. But this basic form can take slightly different forms
command by command. In DESCRIPTIVES, for example, the subcommand VARIABLES
immediately follows the command DESCRIPTIVES, and before the slash (/).

15
It is always a good idea to add comments to your syntax file for the documentation purpose. Use
an asterisk (*) or the command COMMENT to start your comment text. Remember, all the SPSS
commands must end with a period, and that rule applies to comments as well. This is imperative
to indicate the end of your comment with a period. Let me show you how so. Run the following
bloc of commands. What did you get in your Output Viewer?

* Descriptives for years of education


DESCRIPTIVES VARIABLES=educ
/STATISTICS=MEAN STDDEV MIN MAX SKEWNESS .

You got nothing, except for the log of the syntax you just ran. Why? Because SPSS treats
everything between * (or COMMENT) and the next period as your comment. In this case,
“Descriptives for… SKEWNESS.” is all treated as a bloc of comment, so DESCRIPTIVES …
was not executed as a command (… and you are left dumfounded to find no computation results
shown in your Output Viewer). So, you always must end your comment with a period.

A flip side you can see from this example is that in other words, you can start with * or
COMMENT and keep commenting over multiple lines till you end it with a period. This may be
helpful if you need to add extensive comments to your syntax. So let’s fix our syntax.

* Descriptives for years of education


We can comment over multiple lines, A bloc of your comment between an *
and a period (“.”), over multiple lines.
Just don’t forget to end it with a period .
DESCRIPTIVES
VARIABLES=educ
/STATISTICS=MEAN STDDEV /*standard deviation*/ MIN MAX SKEWNESS .

Notice you have your comment over multiple lines. As you can see, alternatively, you can use /*
COMMENT HERE */ as well. In this case, */ instead of a period indicates the end of your
comment. This way comments can be inserted in your command lines.

Let’s execute this syntax command, including the comment. Select (highlight) the whole syntax
command and hit the Run Current button at the top of the Syntax Editor or hit Ctrl + R keys.

See your results displayed in the SPSS Output Viewer.

16
SPSS has in the left pane the output index table. Click any listed index, and SPSS navigates you
to the corresponding result objects in the right pane (feel free to try). I highlight the descriptives
to bring the corresponding output to my view.

17
You can copy and paste those output items. As an example, try right-clicking on the descriptive
table, selecting “Copy,” and then pasting it on your word processor document.

The average year of school completed is 13.04 years. Surprisingly, there are people with zero
education. There seems to be no real concern about skewness.

How different is the mean years of school completed between male and female? Let’s compare
their mean values.

Analyze
Compare means
Means…

This will bring up the “Means” dialogue box. Select the respondent’s educ (“Year of school
completed”) variable under the “Dependent list” heading, and respondent’s sex for “Independent
list.” Click Options… and add median and skewness to your statistics, and then click Continue.
Then click Paste. Highlight and run the command.

* Comparing years of education by sex .


MEANS TABLES=educ BY sex
/CELLS MEAN COUNT STDDEV MEDIAN SKEW.

You should be seeing the result that on average highest year of education completed is 13.19 for
male respondents and 12.92 for female respondents. The median value seems close to the mean
value for males, so we would expect the variable is mostly normally distributed. Let’s visualize it.

Graphs
Legacy Dialogues
Histogram…

18
The “Histogram” box pops up. Select the education variable for the “Variable”, check the
“display normal curve” box and choose “Respondent’s Sex” to panel our histogram by column.

Paste the syntax, highlight it, and run.

* Histogram of years of education by sex .


GRAPH
/HISTOGRAM(NORMAL)=educ
/PANEL COLVAR=sex COLOP=CROSS.

And you get your histogram like below.

19
Not bad distributions, but (and not surprisingly) the highest years of education is 12 for so many
people, especially for female respondents.

Stem and leaf and box plots are as often used to check variables’ distributions and extreme
values. Here, we use the command EXAMINE and get the descriptive information all at once.

Analyze
Descriptive Statistics
Explore…

The “Explore” dialogue box shows up. Select “educ” (Highest year of school completed) for
“Dependent List” and the sex variable (“Respondent’s sex”) for the Factor List. Then first click
on Statistics and check the “Descriptives” and the “Percentiles” boxes. Continue.

Then click Plots, check the “Factor levels together for Boxplot” and the “Stem-and-leaf” boxes
under the “Descriptive” heading.

20
Continue, Paste, and run it.

* Boxplot and Stem and leaf .


EXAMINE VARIABLES=educ BY sex
/PLOT BOXPLOT STEMLEAF
/COMPARE GROUP
/PERCENTILES(5,10,25,50,75,90,95) HAVERAGE
/STATISTICS DESCRIPTIVES
/CINTERVAL 95
/MISSING LISTWISE
/NOTOTAL.

[Descriptives and percentiles output omitted]

Check the legends (highlighted) to see what the stem and leaf represent in your output.
Highest Year of School Completed Stem-and-Leaf Plot for
sex= Male

Frequency Stem & Leaf

11.00 Extremes (=<5.0)


10.00 6 . 00000
13.00 7 . 000000
30.00 8 . 000000000000000
17.00 9 . 00000000
21.00 10 . 0000000000
33.00 11 . 0000000000000000
172.00 12 . 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000
55.00 13 . 000000000000000000000000000
57.00 14 . 0000000000000000000000000000
34.00 15 . 00000000000000000
95.00 16 . 00000000000000000000000000000000000000000000000
25.00 17 . 000000000000
32.00 18 . 0000000000000000
16.00 19 . 00000000
18.00 20 . 000000000

Stem width: 1
Each leaf: 2 case(s)

Highest Year of School Completed Stem-and-Leaf Plot for


sex= Female

Frequency Stem & Leaf

32.00 Extremes (=<7.0)


29.00 8 . 0000000000
28.00 9 . 000000000
34.00 10 . 00000000000
48.00 11 . 0000000000000000
273.00 12 .
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
80.00 13 . 000000000000000000000000000
109.00 14 . 000000000000000000000000000000000000
36.00 15 . 000000000000
113.00 16 . 00000000000000000000000000000000000000
21.00 17 . 0000000
39.00 18 . 0000000000000
8.00 19 . 000
7.00 Extremes (>=20)

Stem width: 1
Each leaf: 3 case(s)

21
And here is your boxplot.

The top of the box represents the 75th percentile, the bottom of the box represents the 25th
percentile, and the line in the middle represents the 50th percentile (= median). There is no
middle line in the box for female cases, though. That is because the 50th percentile and 25th for
the female sample have the same value (=12. Check your output for the percentile table yourself).
The lines that extend out the top and bottom of the box are called “whiskers,” which represent
the highest and lowest values that are not outliers or extreme values. “Outliers” are values that
are between 1.5 and 3 times the interquartile range (interquartile = box-lengths from the 75th
percentile or 25th percentile), and “extreme values” are values that are more than 3 times the
interquartile range. They are represented by circles and asterisks beyond the whiskers,
respectively.

EXAMINE is a very useful data exploration command. As you may have noticed, you at the same
time can get a histogram (try just adding “histogram” in the above syntax to the / plot
subcommand and running it) and a q-q plot (in the same way, add “nnplot” in the above syntax
and run it).

We can get a good idea about our data by exploring data like this. Let’s continue and get a
frequency table for respondents’ work status and marital status. How many people are working
full-time or unemployed? How many are married or divorced?

Analyze
Descriptive Statistics
Frequencies…

22
After selecting the variables “wrkstat” and “marital,” click the Charts… button. You should get
the “Frequencies: Charts” dialogue box as the below one. Check the “Pie charts” radio button
under the “Chart Type” heading and the “Percentages” button under the “Chart Values” heading.
Click Continue, and then paste the syntax. As you can see, those charts can be obtained through
subcommands available in the FREQUENCIES command.

* Frequencies with pie charts .


FREQUENCIES VARIABLES=wrkstat marital
/PIECHART PERCENT
/ORDER=ANALYSIS.

You should be now seeing frequency tables and nice big pie charts for these two variables.
Approximately half of the respondents are working full-time, and are currently married.

23
Labor Force Status

Frequency Percent Valid Percent Cumulative Percent

Valid Working fulltime 747 49.8 49.8 49.8

Working parttime 161 10.7 10.7 60.5

Temp not working 32 2.1 2.1 62.7

Unempl, laid off 51 3.4 3.4 66.1

Retired 231 15.4 15.4 81.5

School 42 2.8 2.8 84.3

Keeping house 200 13.3 13.3 97.6

Other 36 2.4 2.4 100.0

Total 1500 100.0 100.0

Marital Status

Frequency Percent Valid Percent Cumulative Percent

Valid married 795 53.0 53.0 53.0

widowed 165 11.0 11.0 64.0

divorced 213 14.2 14.2 78.3

separated 40 2.7 2.7 80.9

never married 286 19.1 19.1 100.0

Total 1499 99.9 100.0


Missing NA 1 .1
Total 1500 100.0

24
Let’s get a crosstab to see if there may be any relationship between political views and opinions
about life-prolonging measures.

Analyze
Descriptive Statistics
Crosstabs…

Choose the variable letdie1 (“Allow incurable patients to die”) for the rows and polviews
(“Think of Self as Liberal or Conservative”) for the columns, and then click on Statistics… and
check the box “Chi-square” in the “Crosstabs: Statistics” dialogue box. Click Continue.

Then click Cells… button to get to the “Crosstabs: Cell Display” dialogue box. We want to know
the expected value for each cell to compare with the corresponding observed value, so check
both the “Observed” and “Expected” checkboxes under the “Count” heading.

25
Click Continue. Then once back to the main “Crosstabs” dialogue box, click Paste.

* Crosstab between letdie1 and polviews .

CROSSTABS
/TABLES=letdie1 BY polviews
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ
/CELLS=COUNT EXPECTED
/COUNT ROUND CELL.

Your Chi-square test result is displayed at the bottom of the Crosstabs output.

Chi-Square Tests

Value df Asymp. Sig. (2-sided)

a
Pearson Chi-Square 33.155 6 .000

Likelihood Ratio 32.143 6 .000

Linear-by-Linear Association 27.998 1 .000

N of Valid Cases 929

a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 6.94.

As the note at the bottom of the table says, the test is valid by meeting the test condition (i.e., the
minimum expected value must be more than 5). It is statistically significant, indicating that
political views are associated with opinions about life-prolonging measure. More politically
liberal people are more open to the idea of allowing incurable patients to die.

Okay, let’s now see how closely associated years of education and age of first marriage.

Graphs
Legacy Dialogue
Scatter/Dot…

The Scatter/Dot dialogue box comes up. Click on “Simple scatter” and click Define.

26
You should reach the “Simple Scatterplot” dialogue box. Choose agewed (“Age when first
married”) for the Y-Axis and educ (“Highest year of school completed”) for the X-Axis.

Paste the syntax and run it.

* Scatterplot .
GRAPH
/SCATTERPLOT(BIVAR)=educ WITH agewed
/MISSING=LISTWISE.

And you get the scatterplot (below).

Now, we have 1500 observations in this data, but the number of dots does not seem to be as
many. This is because multiple observations share the same data points. We want to include how
dense each data point is. To do so, we use the Chart Editor.

Double-click on anywhere in the scatterplot output area to invoke the Chart Editor. Then go to:

Options
Bin Element

27
This brings you the “Properties” dialogue box (below). In the “Binninb” tab, select the “Color
Intensity” radio button under the “Count Indicators” heading. Click Apply.

Then you have a scatterplot that includes information about number count for each data point.

28
So from the densest area of the scatter plots above, you can see many people graduated from
high school and then soon got married around the age of 20.

I urge you to closely review the syntax we used in this subsection. So far, we have had SPSS
write codes for us, but again, eventually you should also be able to write your syntax yourself.
You can extend it to perform tasks that the point-and-click cannot.

3. How to Define Variable Properties

In this section, we will learn how to do some data management/modification work. In the last
section, we read external files with different formats and saved them as SPSS data files. Let’s
open the file “xls_gss93.sav”.

File
Open
Data…

Click on the Variable View sheet, and you see no variable information other than variable names.
You don’t have any variable labels, values, and missing values defined either. We will first work
to define variable properties such as these.

29
Define Variable Properties

Here is part of your codebook of this data file (just part of it, for our practice purpose).

Variable Name Variable Label Values Missing Values Width


id Respondent ID Number - - 4
wrkstat Labor Force Status • 1 “Working fulltime” 1
• 2 “Working parttime”
• 3 “Temp not working”
• 4 “Unempl, laid off”
• 5 “Retired”
• 6 “School”
• 7 “Keeping house”
• 8 “Other”
marital Marital Status • 1 “Married” 9 “NA” 1
• 2 “Widowed”
• 3 “Divorced”
• 4 “Separated”
• 5 “Never married”
agewed Age When First Married 0 “nap” 2
99 “na”
sibs Number of Brothers and Sisters 98 “dk” 2
99 “na”
childs Number of Children • 8 “Eight or more” 9 “NA” 1
age Age of Respondent - 99 “NA” 2

We include the information above in the data file. We will first use the point-and-click approach
and then see the syntax code beneath it. From the menu bar at the top, go to:

Data
Define Variable Properties…

This brings up the “Define Variable Properties” dialogue box. Let’s select all the seven variables.

Click Continue.

30
You’ll reach the next dialogue box where you can define variable properties. Select variables one
by one and define their properties according to the codebook. The “Changed” checkbox is
automatically checked once you make changes to value label. An example using the variable
wrkstat below…

(1) Highlight and select (2) Variable label


a variable. Then the
property items will show
up in the right side.

(2) Define measurement level, type, width, etc.

(2) Define value


SPSS automatically checks labels here.
this for you when you make
changes to value labels.

Some variables have missing codes. To define missing values, check the “Missing” checkbox.
For example, the variable marital has the value 9 for missing values. To tell SPSS that 9
represents missing values for this variable, you do the following.

31
Once you finish defining properties for all the variables, click Paste and see what syntax
commands SPSS wrote. For each variable, several commands are used to define its properties.

ƒ To define Measurement level, use VARIABLE LEVEL varname (LEVEL).


ƒ To define Variable label, use VARIABLE LABELS varname ‘label’.
ƒ To define Format, use FORMATS varname (format).
ƒ To define Variable value labels, use VALUE LABELS varname labels.
ƒ To define Missing values, use MISSING varname (values).

Remember, each period must be finalized with period “.”.

Let’s run the commands. Then, go to your SPSS Data Editor’s Variable View. See the results.

32
Save your data (Ctrl + S, or File > Save, or Save button at the task bar )

As I note before, this is how you should change variable properties, because you can keep all the
work and decisions you made for future reference and notes. Further, you can repeat the same
task later again if necessary. Don’t do this by using the hidden dialogue boxes of the Variable
View. It will not allow you to keep any systematic records of your work, and you will most
certainly lose truck of your research work if you keep doing that.

4. How to Create and Recode Variables

Let’s open “GSS93 subset.sav.”

Now, suppose you are arguing that the effect of age on income level is curvilinear—it may
attenuate after reaching some age threshold—and interested in testing this argument using SPSS.
Suppose also that since you are not sure about a specific shape, you think you should try a
natural-logged version and a quadratic version of the age variable. So you need to create new
variables that specify these two types of curvilinearity. In such cases as this, you need to create
new age variables of these functional forms.

Let’s first get descriptive statistics and get a good idea about the variables before proceeding.
Since we already tried it once, let’s do this by writing a syntax command ourselves this time.
Let’s also get a frequency distribution table. We will try writing a syntax command for this too.
Type in the following, and run it.

* Descriptives for the variable age.


descriptives variables = age
/statistics=mean stddev min max skewness.

frequencies variables=age.

The mean value of variable is 46.23 with a standard deviation of 17.42. It ranges from 18 to 89.
The minimum value of this variable is 18 (18 years old) and there is no 0 or below value there
(no below-zero value, of course!), so we can log it as is, without adding anything.

We’ll start with the point-and-click approach, and then go over the syntax command. From the
menu bar at the top, go:

Transform
Compute Variable…

Then you get a new dialogue box “Compute Variable” popping up. The box under the “Target
Variable” heading at the upper left corner is where you type in a newly created variable name.
To define and compute the new variable, enter the expression in the box “Numeric Expression.”

33
Let’s start by creating a logged age variable. We call our new variable lnage, so type in “lnage”
in the “Target Variable” box. This is a numeric variable, so click on the Type & Label… button
right below and make sure it is specified as numeric. Also label this variable “Logged Age.”
Click Continue to be back to the “Compute Variable” box.

Then type in your expression in the blank of “Numeric Expression.” We use the function
LN(numexpr) which returns a base-e log of numexpr (i.e., number or expression). You also can
find functions in the boxes under the “Function group” heading and the “Functions & Special
Variables” heading in the right side. For LN, select “Arithmetic” in the former, and then find and
double-click on LN in the latter. The function is automatically entered in the “Numeric
Expression” box. Plug in the variable age in the parenthesis.

(1) Enter new


variable name
(3) Type in expression. You can directly type
in LN(age), but if you cannot recall functions
or variable names, you can select the below
two boxes in the lower right side and the
variable list in the left side.

(2) Click this button to


open the Type & Label
box, specify type and
label the new variable
“Logged age”. Click
Continue and back to
this dialogue box.

Click Paste and see what commands SPSS writes.

COMPUTE lnage = LN(age) .


EXECUTE .

VARIABLE LABELS lnage 'Logged Age' .

COMPUTE newvar = expression. is a frequently used command to create a new variable. And
you will probably feel once you are used to it, it is much quicker to write and run this syntax
command than keep pointing and clicking. It really is. LN(expression) is a function to return a
natural log value. We have learned VARIABLE LABELS varname ‘label’ before.

34
Now, this process involves a transformation command COMPUTE. This creates new variables
and hence updates your data anew. As I mentioned above already, SPSS won’t perform the data
transformation/reading until it needs to, which conversely means it will when it needs to.
Meanwhile, to explicitly force a data pass, one can run EXECUTE. SPSS by default
automatically adds EXECUTE, when transformation commands are pasted from a dialogue box.

Just to get the idea how it works, try running your command without EXECUTE first, and see
your Data Editor. A new column is created for the lnage variable in the Data Viewer, but the data
is not read into SPSS, because we didn’t have any data pass, whether it’s EXECUTE or any
procedural command.

Now highlight and run EXECUTE, and see what happened to your Data Viewer. SPSS spits what
it secretly keeps in its memory and executes the data transformation, and now you have the new
data read into SPSS with the newly created variable lnage.

Let’s also create a quadratic version of the age variable. Let’s call this new variable sqage. This
time, we just use the Syntax Editor. A squared term of a can simply be expressed a*a.

* Squared term of the age variable.


compute sqage = age*age.
variable labels sqage 'Quadratic Age'.

descriptives variables = sqage.

Now, first, highlight the first two lines and see the Data Viewer. Again, a new column is created
for the variable “sqage,” but no data transformation has been executed yet. Then, this time,
instead of EXECUTE, we run a procedural command descriptives to obtain descriptive statistics
for this new variable. You get the below result, and if you check the Data Viewer you see the
new data is read in.

Descriptive Statistics

N Minimum Maximum Mean Std. Deviation

Quadratic Age 1495 324.00 7921.00 2440.0957 1789.00139

Valid N (listwise) 1495

In this example, there is no EXECUTE, yet SPSS still performed the data transformation because
it needs to read in sqage to execute the procedure DESCRIPTIVES for this variable. The point is,
SPSS waits to make data changes until it absolutely needs to do so. This way, the number of data
readings decreases and thereby SPSS’s processing speeds up.

Let me emphasize again: Therefore, in most cases you don’t need to run EXECUTE every time,
because SPSS reads the data when it needs to/has to. Unnecessary EXECUTE makes SPSS read
the data again and again even when it doesn’t have to, and as a result slows down the processing.
So use this command just sparingly.

35
This means you can most of the time remove EXECUTE that SPSS by default automatically
generates when you paste transformation commands from a dialogue box. But it’s annoying
SPSS pastes and you delete EXECUTE again and again. We don’t want SPSS to be so eager to
insert EXECUTE every time. So, let’s make SPSS a little lazy.

Edit
Options…

Then you will see the “Options” dialogue box. Click on the “Data” tab, and check the “Calculate
values before used” radio button under the “Transformation and Merge Options” heading.

Click OK.

Let’s try pasting the same syntax to create the lnage variable from the dialogue box and see what
this option change does for us. Click on the Dialogue Recall button of the menu bar (the SPSS
windows has the same menu bar), and from the drop-down list, recall the command we just ran
from the dialogue box, which is COMPUTE.

Pull-down menu
shows up.

You should be seeing now the same dialogue box as this one, and our last work is still there.
Let’s click Paste (no worries about the “Change existing variable?” message; we are just pasting
the command onto the Syntax file), and see what is pasted on your syntax file. Can you see that

36
SPSS “lazy” now, i.e., it does not paste EXECUTE this time? This means that SPSS won’t
perform data transformations after every transformation command.

So, you don’t have to force a data pass every single time; let SPSS read updated data when it
needs to. There are, however, some specific situations where you absolutely and explicitly need
to get SPSS to run EXECUTE and force a data pass. Let’s take a look at the following example.

* Must-use EXECUTE example (1)


Rule 1: Lag functions and EXECUTE .

* First create a mini data set.


data list free / var1.
begin data
12345
end data.

compute var2 = var1.


list.

* (1)-(a): lag() function w/o intervening EXECUTE.


compute lagvar1 = lag(var1) .
compute var1 = var1*var1 .

* (1)-(b): lag() function w/ intervening EXECUTE.


compute lagvar2 = lag(var2) . Here’s the
execute. difference!

compute var2 = var2*var2 .

list.

What we did above is first to create a simple data containing var1 and var2, which are actually
the same with five observations whose values are 1,2,3,4,5, and then to lag those variables by
using the function LAG(). The only difference between (1)-(a) and (1)-(b) is whether EXECUTE.
is placed after computing those lag variables. Now, see what you got in your output (or the Data
Viewer). What difference did EXECUTE make to the new lagged variables?

var1 var2 lagvar1 lagvar2

1.00 1.00 . .
4.00 4.00 1.00 1.00
9.00 9.00 4.00 2.00
16.00 16.00 9.00 3.00
25.00 25.00 16.00 4.00

Number of cases read: 5 Number of cases listed: 5

Look at lagvar1 and lagvar2. You might have been assuming you were creating a set of the same
lagged variables, but you got different results.

37
The key, of course, is the presence or absence of EXECUTE after compute lagvar# = lag(var#) .
The difference happened because the function LAG() is calculated after all other transformations
are performed, regardless of command order. So, in the example (1)-(a) without an intervening
EXECUTE, the new variable lagvar1 was created from the transformed values of var1 (i.e.,
var1*var1). SPSS executed compute var1 = var1*var1 . first, and then, only then, calculated
compute lagvar1 = lag(var1) . . In contrast, in the example (1)-(b), we explicitly placed an
intervening EXECUTE after compute lagvar2 = lag(var2), meaning that we forced SPSS to
transform the data and create lagvar2 at that point, before moving on to var2 transformation.
Thus, var2 from which new variable lagvar2 was created was its original 1,2,3,4,5 values.

So, depending on what you mean to do, you need to explicitly force a data pass when you use
LAG(). This is the rule No.1 about the placement of the EXECUTE command between
transformation commands.

Let’s take a look at another example. Examine the below syntax.

* Must-use EXECUTE example (2)


Rule 2: System variable $CASENUM, SELECT IF and EXECUTE.

* (2)-(a): $casenum and SELECT IF, w/o intervening EXECUTE.


compute var3 = $casenum.
select if (mod(var3,2) = 0).
descriptives var3.

* (2)-(b): $casenum and SELECT IF, w/ intervening EXECUTE.


* First re-run Must-use EXECUTE example (1) syntax to bring back the data .
compute var3 = $casenum. Here’s the
execute. difference!
select if (mod(var3,2) = 0).
list variables = var1 var2 var3.

$CASENUM is a system variable that contains current case sequence number (i.e., 1,2,3,4,5… n).
SELECT IF (expression) is a command for case selection based on specified criteria after IF.
MOD(a, non-zero b) is a function that returns the remainder when a is divided by b. So, in the
above syntax we are telling SPSS to select cases where var3 are even numbers.

We will continue to use the mini data set we created (be sure to have this data active). Now, let’s
highlight and run example (2)-(a). What did you get?

Warnings

No cases were input to this procedure. Either there are none in the working data file or all
of them have been filtered out.

This command is not executed.

And indeed, you don’t have any observation in your Data View.

38
Why did this happen? It’s a combination of the two following things. First, the value of
$CASENUM keeps changing in a dynamic manner. For example, if you delete the first case with
the value 1, the formerly the second case with the value 2 moves up and becomes the first case
with the value 1. Secondly, SELECT IF sequentially deletes each unselected case. So, in the
example above, SPSS goes through the following sequence: (1) sees compute var3 =
$casenum. , (2) creates var3 and gives the first case a value of 1, (3) evaluates it against the
selection criterion select if (mod(var3,2) = 0). , (4) decides the first case does not meet it, and
(5) delete the case. Now, SPSS comes back to the top of this loop, the formerly second case now
has a value of 1 for var3, SPSS sees it, decides it does not meet the selection criterion, and
deletes it… keeps going the loop until it reaches the last observation (in this case the 5th
observation). Notice no data reading happens throughout this sequence. It is only then that SPSS
sees the procedural command descriptives var3. and tries to read the data to execute
descriptives. But of course, at this point, all the cases are gone and there is no data left to read
in.

That is not what we wanted to do, of course. What should we do? We need to force SPSS to read
the data after the transformation compute var3 = $casenum. to finalize the data before it starts
selecting cases. Let’s re-create the same mini-data set (because it’s gone!) and then highlight and
run the example (2)-(b).

var1 var2 var3

4.00 4.00 2.00


16.00 16.00 4.00

Number of cases read: 2 Number of cases listed: 2

Yes, this is exactly what we meant to do. We first created var3 (1,2,3,4,5), finalized it, selected
even-number cases (i.e., 2 and 4), and then printed it.

OK, here’s one last example about EXECUTE in this workshop. Examine the following.

* Must-use EXECUTE example (3)


Rule 3: Transformation command, MISSING VALUES and EXECUTE.

* First, create a mini data set.


data list list / var1 var2 var3 var4.
begin data
1014
2129
3056
4219
5025
6 2 6 13
7571
8022
end data.

39
list.

* (3)-(a): Transformation followed by MISSING VALUES involving that var,


w/o intervening EXECUTE.
compute var5 = 0.
if var2 = 0 var5 = 1 .
missing values var2 (0).
list.

* Clear missing values.


missing values var2 ().

* (3)-(b): Transformation followed by MISSING VALUES involving that var,


w/ intervening EXECUTE.
compute var6 = 0.
if var2 = 0 var6 = 1 . Again, here’s
the difference!
execute.
missing values var2 (0).
list.

After creating a small data set, we first create a new variable var5 in example (3)-(a). Set all the
observations to 0 first, then replace them with 1 for those cases where var2 has the value 0, so
that we can create var5 as a 0/1 dummy variable. There should be four cases coded as 1 in var5
because there are as many 0’s in var2. Then we use the command MISSING VALUES variable
(value) to declare 0’s of var2 as user-defined missing values.

Now, with the small data, let’s first highlight and run (3)-(a). What did you get?

Your var5 has a value of 0 for all the observations, although the value 0 of var2 is now defined
as missing, as you can see from the Variable View of the SPSS Data Editor. It’s not exactly what
we wanted; var5 should have the value of 1 when var2 is 0. Why does this happen?

This is actually yet another situation where you must use EXECUTE explicitly; be careful when
you have transformation commands followed by MISSING VALUES that works on the same
variables as the transformations, because the command MISSING VALUES changes the
dictionary (i.e., variable info in the Variable View) before the transformations are executed. In
this example, the value 0 of var2 is defined as missing before var5 is created and then modified
on the condition of var2, and hence transformation of var5 where var2 = 0 does not occur.

So what we need to do is to complete the transformation and force a data pass (i.e., finalize var5)
before MISSING VALUES defines the value 0 of var2 as missing. That’s where EXECUTE
comes in. Place it before MISSING VALUES so that the data transformation is executed before the
missing value command is run. Let’s keep having this mini data active, and after resetting the missing
value definition for var2 (i.e., “* Clear missing values.” part of the above syntax), let’s
highlight and run (3)-(b) to create var6 that is 1 when var2 = 0, whereas to define the value 0 of
var2 as missing.

Now, did you get something different this time?

40
var1 var2 var3 var4 var5 var6

1.00 .00 .00 4.00 .00 1.00


2.00 1.00 2.00 9.00 .00 .00
3.00 .00 5.00 6.00 .00 1.00
4.00 2.00 1.00 9.00 .00 .00
5.00 .00 2.00 5.00 .00 1.00
6.00 2.00 6.00 13.00 .00 .00
7.00 5.00 7.00 1.00 .00 .00
8.00 .00 2.00 2.00 .00 1.00

Number of cases read: 8 Number of cases listed: 8

See the difference between var5 and var6? Yes, this is what we wanted!

These three above are oft-encountered situations where you need to explicitly force a data pass
and you should keep them in mind.

Rule 1: Lag functions and EXECUTE .


Rule 2: System variable $CASENUM, SELECT IF and EXECUTE.
Rule 3: Transformation command, MISSING VALUES and EXECUTE.

The other two situations are when you run WRITE or XSAVE, both of which are treated as
transformation commands. Ending your program with WRITE or XSAVE without any procedural
command that forces a data pass leads to an empty data file (because, simply, it is not written or
saved). In such cases, you often need EXECUTE after you run those commands.

For more about WRITE or XSAVE, see

Help
Command Syntax Reference

You can now close the mini data file without saving it.

Recoding Variables

OK, let’s next learn how to recode existing variables. We have the variable “Marital status” in
the GSS data. With the data file GSS93 subset.sav active, we start by running descriptive
statistics to get a good idea what the variable looks like.

descriptives variables = marital


/statistics=mean stddev min max skewness.

frequencies variables=marital.

The results are below.

41
Descriptive Statistics

N Minimum Maximum Mean Std. Deviation Skewness

Statistic Statistic Statistic Statistic Statistic Statistic Std. Error

Marital Status 1499 1 5 2.24 1.563 .847 .063

Valid N (listwise) 1499

Marital Status

Cumulative

Frequency Percent Valid Percent Percent

Valid married 795 53.0 53.0 53.0

widowed 165 11.0 11.0 64.0

divorced 213 14.2 14.2 78.3

separated 40 2.7 2.7 80.9

never married 286 19.1 19.1 100.0

Total 1499 99.9 100.0

Missing NA 1 .1

Total 1500 100.0

This variable has five categories, coded as 1 to 5. The largest category is “married,” and the next
largest is “never married.” “Separated” is by far the smallest category. Substantively, the middle
three categories may be collapsed to create a new variable with three groups (1) currently
married (2) previously married (with the assumption that separation is effectively marital
dissolution) (3) never married:

1. Married → 1. Currently married


2. Widowed
3. Divorced → 2. Previously married
4. Separated
5. Never married → 3. Never married

To perform this recode, let’s go from the pull-down menu bar at the top,

Transform
Recode
Into Different Variables…

42
We select Into Different Variables… rather than Into Same Variables… because we want to
create a new 3-category variable with the original 5-category variable intact (if we use “Into
Same Variables” the original variable would be overwritten with the new one).

Select the variable marital. Under the “Output Variable” heading in the right side, name our
output (new) variable “marital3cat” and add a label (“Marital Status 3 Category”). Click Change.
The middle pane (“Numeric Variable -> Output Variable”) should now show “marital -->
marital3cat.”

(2) We recode
(1) Select “marital” into “marital” into a new,
the “Numeric Variable -> different variable.
Output Variable” pane. Decide on a new name
for your new variable,
then label it. Click
Change.

(3) Click this


button.

Next, we define this new variable “marital3cat” based on the old variable “marital.” Click Old
and New Values, and you will get another dialogue box “Recode into Different Variables: Old
and New Variables” (below). We want to keep the category 1 (married) as is, collapse the
categories 2, 3, 4 of the “old” variable (marital) into a “new” category and call that 2, and change
the category 5 of the “old” variable (never married) to a new category 3. Here is how to do it.

(1) Use “Range” as


(2) We want to lump
it’s 2 through 4 we
2-4 into a new 2.
want to recode into
a new category.
(3) Click Add.

43
(4) Recode 1 to 1 and 5 to 3 into
a new variable too. Use “Value”
instead of “Range.” For
everything else follow the same
steps as the above.

(5) Once you are done, click


Continue to go back to the previous
Recode into Different Variables
dialogue box. Then click Paste to
paste SPSS syntax onto a Syntax
Editor.

Once you are done with the point-and-click recoding work, click Continue and go back to the
“Recode into Different Variables dialogue box.” Click Paste and see what syntax commands
SPSS writes for you.

RECODE marital (1=1) (5=3) (2 thru 4=2) INTO marital3cat.


VARIABLE LABELS marital3cat 'Marital status 3 categories'.

The command RECODE oldvarname recode argument INTO newvarname is to recode


variables into new ones. SPSS adds the command VARIABLES LABELS as we specified
(“Marital Status 3 Category”). As you can see, the syntax to do this task is quite simple,
compared with quite a bit of pointing and clicking we did. This is why you should learn to write
your own syntax yourself!

We don’t have value labels for the new three categories, but we know how to do it by using
syntax, so let’s add another command to this syntax file to label values. Also, we want to have it
without any decimals, so format it the same way as marital. Also remember, RECODE is a
transformation command (as it entails transformation of data), and SPSS does not execute it and

44
read the data until it has to. Let’s get frequency distribution of the new variable. It at the same
time forces a data pass and let us check the new variable.

Our modified syntax looks like this.

RECODE marital (1=1) (5=3) (2 thru 4=2) INTO marital3cat.


VARIABLE LABELS marital3cat 'Marital status 3 categories'.

VALUE LABELS marital3cat


1 'Married'
2 'Was Married'
3 'Never Married' .
FORMATS marital3cat (F4.0).

frequencies variable = marital3cat.

Highlight and run. Looks like we got it done right.

Marital status 3 categories

Cumulative

Frequency Percent Valid Percent Percent

Valid Married 795 53.0 53.0 53.0

Was Married 418 27.9 27.9 80.9

Never Married 286 19.1 19.1 100.0

Total 1499 99.9 100.0

Missing System 1 .1

Total 1500 100.0

5. How to Subset (Select) Data

Sometimes, you may only need some small portion of the data file. For example, you know for
sure your analysis will not use some variables and you want to drop them to create a smaller file.
Or, your study may focus on female observation only and you want to limit your sample to that
gender group. In this section, we will learn how to subset data (variables or observations).

Subsetting Variables

Suppose we want to limit our “GSS93 subset.sav” data to only those variables we are interested
in for our research project.

45
Before we drop the variables other than our variables of interest, let’s double-check the data
information to make sure you do not forget any important variables to include.

In your new syntax file, type in and run:

* Get data information.

display dictionary .

SPSS printed out for you the “dictionary” in SPSS Output Viewer – all the same variable
information and value label information as you can get from the Variable View tab.
Suppose that looking through the variable information, you decide that these below are the
variables you will need for your analysis.

id wrkstat marital agewed sibs childs age educ degree padeg madeg sex race relig

Let’s create a file that includes those variables above only. Again, we first do subsetting by the
point-and-click approach, and then see how you can do the same by writing your syntax
command.

Have your SPSS Data Editor active (i.e., bring it to the top). From the menu bar, go

File
Save As…

And you have the “Save Data As” box. Make sure to choose the directory you want to save your
subset data in. Here, we are going to save it in our working directory verybasicSPSS. Next, we
need to decide the new file’s name in the “File name:” box. Let’s call it “GSSvarsub.”

(1) Select the location you


want to save your new subset
file.

(2) Name your new (3) Click on


file. Variables…

46
Now, click Variables… and you will get another dialogue box called “Save Data As: Variables.”
By default, SPSS keeps all the variables (all the variables are marked with an X). We will select
those 14 variables listed above.

You could de-select the variables that you don’t need by clicking on their check boxes. Or, in
this case, I would first click Drop All and then select what I want to keep by clicking on their
check boxes, as the number of variables I want to keep is rather small (in the former way you
need to uncheck 45 boxes, in the latter to check 14 boxes… so more efficient). When you finish
selecting what you need, click Continue.

By default, all the variables


are selected. You can de-select
what are unnecessary for you. In this case I would deselect
all first by clicking Drop
All, and then select the 14
variables. Fewer times of

Click Continue when


finish selecting variables.

Now you should be seeing the message “Keeping 14 of 68 variables.”

Click Paste and see the syntax commands SPSS spits for this task.

SAVE OUTFILE='D:\[Your working directory here]\verybasicSPSS\GSSvarsub.sav'


/DROP=birthmo zodiac income91 rincom91 region xnorcsiz size partyid vote92
polviews cappun gunlaw grass life chldidel pillok sexeduc spanking letdie1
news tvhours bigband blugrass country blues musicals classicl folk jazz
opera rap hvymetal attsprts visitart tvshows tvnews tvpbs scitest4 partners
sexfreq dwelown sei cohort income4 degree2 agecat4 politics region4 married
classic3 jazz3 rap3 blues3 /COMPRESSED.

Looks messy, but the whole structure is pretty simple.

47
SAVE OUTFILE = ‘your_new_file_name.sav’.

is what you use to save your SPSS data file. Then to select some variables to create a subset of
the original, we use either one of the subcommands

/DROP = list of variables to drop.


/KEEP = list of variables to keep.

SPSS uses /DROP there, but if you write your own syntax commands, you of course can list the
14 variables by using /KEEP. In this case, that would actually be simpler. /COMPRESSED is
just to save a file in compressed form (this is default; meaning you don’t have to specify this
when you write syntax yourself to save a file).

Again, the pasted syntax looks messy with an array of many variable names, but it doesn’t have
to be messy like that when you write your syntax yourself to subset a file, because consecutive
variables such as A, B, C, D, E can be written A to E.

save outfile ='D:\[Your working directory here]\verybasicSPSS\GSSvarsub.sav'


/keep= id to age educ to race relig
/compressed.

Another reason you should learn to write your own syntax!

Let’s highlight and run the syntax, and then check your working directory; your new data file
should be saved there. Let’s open the new file and first see if everything looks okay.

Now that we have a new file for our research project, let’s leave a brief comment to your data
file helps you keep organized. We can do this by using the command DOCUMENT. With this
data file active, let’s write the following command and run it.

* Document this work in the new subset file.


document Subset of "GSS93 subset" inc the necessary variables for project A.

* Let’s display the document we just created.


display document.

You should get the below output in your Output Viewer. Your data comments are stored with the
date information. This way, you won’t lose track of what each data in your working directory is
about.

Document

1a document Subset of "GSS93 subset" inc the necessary variables for


project A.

a. Entered 11-Mar-2009

To drop the comment, use the command DROP DOCUMENT.

48
Now, suppose your study focuses on African American population and want to limit your sample
to African American cases only. Suppose also that you want to present a graphic of education
level distribution among this demographic.

Let’s first get the break-down of the variable “race.”

frequencies variables=race.

Race of Respondent

Cumulative

Frequency Percent Valid Percent Percent

Valid white 1257 83.8 83.8 83.8

black 168 11.2 11.2 95.0

other 75 5.0 5.0 100.0

Total 1500 100.0 100.0

So, we will select those 168 African American cases.

Data…
Select Cases…

You will have the “Select Cases” dialogue box below. Check the radio button “If conditions is
satisfied” and click on the If… button.

(1) Check this radio button, and


click on the button If… right
below

Then another dialogue box “Select Cases:” If shows up. Select the variable race (“Respondent’s
Race”) from the left pane, and move it to the right pane by clicking the right-headed arrow. We
want to select the 168 African American cases, which are coded as 2 in the data as you can see
from the Variable View or your dictionary. So, complete the expression accordingly, that is, we
are selecting cases if race = 2.

Click Continue.

49
(2) Select “Respondent’s race” from the
left pane. “black” is coded as 2, so
complete the expression accordingly.

(3) Then
click
Continue.

Once you are back to the “Select Cases” dialogue box (the first one), click Paste and see the
commands SPSS writes and acts on (below).

USE ALL.
COMPUTE filter_$=(race=2).
VARIABLE LABEL filter_$ 'race=2 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMAT filter_$ (f1.0).
FILTER BY filter_$.

As you can see, SPSS creates a variable “filter_$” based on the race variable, with 0 = “Not
Selected” and 1 = “Selected.” Thus, the African American cases should have this variable coded
as 1 (because the African American cases will be “selected”).

Let’s obtain a frequency table based on the variable “race.” This way the newly selected data is
read in while the new data is checked.

frequencies variables = race.

Race of Respondent

Cumulative

Frequency Percent Valid Percent Percent

Valid black 168 100.0 100.0 100.0

Yes, we have 168 African American observations in our data now.

50
Now, bring SPSS Data Editor to the front and see what happened to our original data.

race != 2 are
crossed out. A “filter” variable
SPSS created to filter
observations.

SPSS indicates
your selection
filter is on.

Can you see what is going on here? The observation numbers in the row header for those cases
where race = 1 and 3 are simply crossed out, but SPSS seems to hold all the information of the
1500 observations. What SPSS does here is just to filter out the non-black observations by using
the filter variable (FILTER BY filter_$). Scroll it to the right, and you will see the variable
“filter_$” SPSS created for this selection task. As we noted above, the selected cases (i.e.,
African Americans) are coded as 1, the unselected are 0 (White and others).

Let’s see the distribution of education levels among the African American respondents.

frequencies variables = educ.

See the output. SPSS gives you a frequency distribution table for the black cases only. There are
168 observations, but one case is missing for the education variable.

51
Now, suppose you want to restore the whole data. We will write and run the commands below
(very simple!)
filter off.
use all.

By this we turn off the filter SPSS used to select cases and tell SPSS to restore the whole file
(use all.). See the Data Editor and see what happened. All the observations are now back in.

What is nice about using a filter to subset observations is, as you just saw, it is temporary. When
you are conducting analysis, you may want to subset observations in many different ways. It is
flexible to create a filter and turn it on and off to select observations. And you can keep the
whole original data intact.

We can subset observations permanently. We already used SELECT IF (expressions). The


difference between SELECT IF (expressions) and FILTER BY… is the former is to permanently
select observations while the latter is temporary.

So if you subset observations on the race variable by using SELECT IF (expressions) and run
frequencies …

select if (race=2).
frequencies variables = educ.

Then in the Output Viewer, you should get exactly the same frequency distribution table.
However, check the Data Viewer. How many observations do you have now in the data file?

52
We have now only 168 observations, all being African American. SPSS does not cross out the
unselected observations. It instead deleted them, permanently.

This method is good if you want to create and save a subset file which only includes cases that
meet certain conditions (e.g., females only, those with higher education only, etc), but unlike the
filter, you cannot restore the deleted cases unless you go back to the original file, so it may be
inconvenient when you are conducting analysis and frequently select and re-select cases. You
should choose which way to go depending on your purpose.

6. How to Sort and Split Data

In this section, we are going over how to sort data or conduct data analysis by splitting data.

Sorting Data

Let’s open the data file “GSSvarsub.sav” if it’s not already open.

Sorting data is simple and easy. Suppose we want to sort this data by sex (male = 1, female = 2).
From the pull-down menu bat at the top, go:

Data
Sort Cases…

And the “Sort Cases” dialogue box appears.

Select the variable sex (“Respondent’s Sex”) and click Paste.

SORT CASES BY sex(A).

53
Too simple a command, isn’t it? The (A) following the variable name sex means that
observations will be sorted in ascending order. That is the default, so you don’t have to specify it
when writing your own syntax. Let’s highlight and run the command, then list the variable.

list variables = id sex.

As you can see, the data is ascendingly sorted.

If you need to sort observations in descending order, you need to explicitly specify that with (D)
in the syntax (instead of (A)). Try it yourself.

sort cases by sex (d).

You can of course sort by more than one variable. For example, if you want to sort observations
by sex, and then within each sex category sort observations by marital status, simply place the by
variables in that order (see below).

sort cases by sex marital.


list id sex marital.

Split Observations

Suppose you want to obtain group-by-group numbers, such as average years of education by sex.
As we already saw, this can be done fairly easily; the command MEANS has the option BY. You
can write and run this simple syntax command yourself to achieve the goal.

means educ by sex.

And you get the following result.

54
Report

Highest Year of School Completed

Responden

t's Sex Mean N Std. Deviation

Male 13.19 639 3.349

Female 12.92 857 2.849

Total 13.04 1496 3.074

But how can we obtain separate analysis when the commands you want to use does not have this
BY option? Suppose, for example, you suspect that years of education and the number of
children are differently correlated across sex—say, having children often makes people interrupt
or give up on education early, but perhaps females are more adversely impacted than males and
this group-by-group correlations may give us some clue about this argument. The problem,
however, is that the command CORRELATIONS does not have any BY option and does not let
you obtain this statistic by sex and make a comparison in a one-step way.

In such cases as this, here is what you do. From the pull-down menu, go:

Data
Split File…

The “Split File” dialogue box shows up.

55
Check the radio button “Compare groups” and select the variable “Respondent’s Sex.” We
already sorted the data, but check the “Sort the file by grouping variables” radio button just in
case. Click Paste.

SORT CASES BY sex.


SPLIT FILE LAYERED BY sex.

LAYERED is the default way that SPSS organizes the output, so when/if you write the syntax
yourself, you don’t have to add this (i.e., just split file by sex. will do). Now we can run
correlations between years of education and the number of children, by sex. Let’s add the
following command.

correlations educ childs.

Now, highlight all and run it. You should get the correlation matrix organized by sex (below).

Correlations

Highest Year of Number of

Respondent's Sex School Completed Children

Male Highest Year of School Pearson Correlation 1 -.182

Completed
Sig. (2-tailed) .000

N 639 636

Number of Children Pearson Correlation -.182 1

Sig. (2-tailed) .000

N 636 638

Female Highest Year of School Pearson Correlation 1 -.282

Completed Sig. (2-tailed) .000

N 857 855

Number of Children Pearson Correlation -.282 1

Sig. (2-tailed) .000

N 855 857

Now the analysis is conducted for each of the groups of the sex variable we specify as a split
variable. The number of children is negatively correlated with the highest year of school
completed for both the sex groups, but the association is stronger for females, which is in line
with our expectation.

Note that SPLIT FILE is in effect until you explicitly turn it off. Turn it off by running SPLIT
FILE OFF. Let’s run the same correlation and see what we get.

56
split file off.
correlations educ childs.

Correlations

Highest Year of Number of

School Completed Children

Highest Year of School Pearson Correlation 1.000 -.237


Completed
Sig. (2-tailed) .000

N 1496.000 1491

Number of Children Pearson Correlation -.237 1.000

Sig. (2-tailed) .000

N 1491 1495.000

You can see now SPSS run the analysis on the whole data without creating groups.

7. Simple Regression Example

Now, let’s go over a quick and simple regression example. We use the data file “GSS93
subset.sav”. Make this file open and active.

Suppose we are interested in social and demographic factors that might account for respondents’
socioeconomic status measured by the socioeconomic index (SEI). To answer this research
question, we set forth the following hypotheses and conduct statistical tests.

H1: A respondent’s years of work experience increase his/her SEI, but with diminishing
return (implies quadratic term).
H2: The more cultural capital respondent has at their family of orientation, the higher
his/her SEI is.
H3: The more divided resource allocation is at respondent’s family of orientation, the
lower his/her SEI is.

Then we decide to operationalize the concepts in those statements in the following way.

(1) Use respondents’ age information to proxy respondents’ years of work experience.
(2) Use mother’s education as an indicator of cultural capital at respondents’ family of
orientation. Specifically, see if it makes a significant difference if their mother received
education of 2-year college or higher.
(3) Use the number of respondents’ siblings to measure resource allocation at their family of
orientation.

57
(4) Include a dummy variable for sex and race as our control variable, where female = 1 and
black = 1, respectively. Both groups are expected to have a lower score of SEI on average.

So, first of all, let’s create new variables to do the planned analysis. Because we suspect the
length of work experience has diminishing return of SEI, we want to create a quadratic term of
age.

compute sqage = age*age .

We also want to create a dummy variable that indicates whether respondents’ mother has
education of 2-year college or higher. As we can see from the Variables button

So, we need to recode the madeg variables and create a new variable macol.

0 and 1 → 0
2 through 4 → 1
7 through 9 → 9
Then code 9 as this variable’s missing value.

RECODE madeg (1=0) (2 thru 4=1) (7 thru 9=9) INTO macol.


VARIABLE LABELS macol 'mother college degree = 1'.
MISSING VALUES macol(9).
VALUE LABELS macol
0 'College -'
1 'College +' .

We also recode the variable sex (“Respondent’s sex”) to create a dummy variable female and
race (“Respondent’s race”) to create a dummy variable “black.”

58
RECODE sex (1=0) (2=1) INTO female.
VARIABLE LABELS female 'female = 1'.

RECODE race (1=0) (3=0)(2=1) INTO black.


VARIABLE LABELS black 'black = 1'.

Now, we have all the variables ready for the analysis.

Analyze
Regression
Linear…

Select SEI for the dependent variable, and select sqage, age (age), mother’s college education
(macol), female (female), and race (black) under the “Block 1 of 1” heading.

Then click Statistics… to bring up the “Linear Regression: Statistics” dialogue box. Check the
“Collinearity diagnostics” box. Click Continue.

Click Continue. Then click Paste.

REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA COLLIN TOL
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT sei
/METHOD=ENTER sqage age macol sibs female black .

59
REGRESSION is SPSS’s command to run an ordinary linear regression model. What you need
when you write your own syntax is highlighted lines in gray. The line of the subcommand
/STATISTICS would be unnecessary if you simply want to get default statistics (i.e., coefficients,
ANOVA, multiple R [model summary], excluded variables [which are not relevant here]). We
need this line in this case because we ask for COLLIN and TOL, which are both collinearity
diagnostics.

Highlight and run those command lines. Here are some of our results.

b
Model Summary

Std. Error of the

Model R R Square Adjusted R Square Estimate

a
1 .292 .085 .078 17.7582
a. Predictors: (Constant), black = 1, mother college degree = 1, female = 1, sqage, Number of Brothers and Sisters, Age of
Respondent
b. Dependent Variable: Respondent Socioeconomic Index

a
Coefficients

Standardized

Unstandardized Coefficients Coefficients Collinearity Statistics

Model B Std. Error Beta t Sig. Tolerance VIF

1 (Constant) 32.337 5.364 6.029 .000

sqage -.007 .003 -.490 -2.693 .007 .036 27.759

Age of Respondent .865 .240 .656 3.604 .000 .036 27.760

mother college degree = 1 6.931 1.615 .149 4.290 .000 .990 1.011

Number of Brothers and


-.587 .283 -.072 -2.074 .038 .982 1.018
Sisters

female = 1 -3.955 1.286 -.107 -3.075 .002 .991 1.009

black = 1 -6.586 2.557 -.090 -2.576 .010 .977 1.023

a. Dependent Variable: Respondent Socioeconomic Index

As expected, the quadratic age variable is in the negative direction and statistically significant.
Mother’s education, a measure of respondents’ cultural capital at their family of orientation,
shows a significant positive impact on respondents’ socioeconomic status. Having more siblings,
on the other hand, seems to reduce resources available to people and lead to lower
socioeconomic status. Finally, females and African Americans are on average a lower
socioeconomic status than males and any other race groups. The overall explanatory power of
the model is not quite strong, as indicated the R2 under “Model Summary.”

60
As for the collinearity diagnostics we tried, tolerance and VIF are inversely related (i.e.,
1/tolerance = VIF) and thus tell you the same information. Although there is no definite cut-off
line, a rule of thumb is VIF > 10 (or tolerance 0.1) merits further investigation. In our example,
the VIF is high for the age variables, but this is fully expected since one is the squared term of
the other and hence they are highly collinear by definition. Otherwise, the tolerance/VIF values
all look okay.

Another way to check collinearity problems is to use “Collinearity Diagnostics” below. The
general rule of thumb is the condition index larger than 30 indicates strong collinearity. The
dimension 6 has an over 30 number, but this one is again due to the age variables as highlighted
in gray below. Otherwise, the result adds support to the absence of collinearity problem.

Collinearity Diagnosticsa

Variance Proportions

mother Number of
Condition Age of college Brothers female = black =
Model Dimension Eigenvalue Index (Constant) sqage Respondent degree = 1 and Sisters 1 1

1 1 4.317 1.000 .00 .00 .00 .01 .01 .02 .00

2 .938 2.145 .00 .00 .00 .00 .00 .00 .91

3 .788 2.341 .00 .00 .00 .93 .01 .00 .00

4 .434 3.155 .00 .00 .00 .01 .05 .87 .00

5 .397 3.299 .00 .01 .00 .00 .61 .00 .07

6 .125 5.886 .07 .02 .00 .04 .32 .11 .00

7 .003 41.477 .93 .97 1.00 .01 .00 .00 .00

a. Dependent Variable: Respondent Socioeconomic Index

This is the end of The Very Basic SPSS. Thanks for playing! 1

1
Error Report: Statistical-Computing@sas.upenn.edu

61

You might also like