You are on page 1of 12

Using SPSS for Statistical Analysis

A course for Beginners


by Leo Fernandez

Session II: Describing Data


1 Types of Variables
In statistics, variables describe attributes of the objects being studied. The value of the variable can
'vary' from one entity or sample element to another.
For example, a person's nationality could be a variable if we are studying people. One person
could be "Mexican" and another "Sudanese". Further, if we consider the two entities described
above (a Mexican and a Sudanese), we might also observe some other attributes of these entities.
For example, the Mexican's height could be 5ft 2in and that of the Sudanese, 5ft 10in.
Variables can be grouped under two broad categories: Qualitative vs. Quantitative Variables.

Qualitative: Qualitative variables are also known as "categorical" variables. They describe
attributes of objects by names or labels. A person's religion (e.g Hindu, Muslim, Christian)
or the colour of the person's eyes (e.g., black, brown, blue) are examples of qualitative or
categorical variables.

Quantitative: Quantitative variables are also know as "numeric" variables. They record a
measurable quantity. For example, when we speak of the population of a city, we are
talking about the number of people in the city - a measurable attribute of the city. Therefore,
population would be a quantitative variable.

In statistical data analysis variables are of following types:


Table - 01
Type
Category
Description

Example
eg: Nationality:
1 = Australian
2 = British
3 = Canadian
4 = Dane
5 = Other
eg: Education
1 = No education
2 = Primary School
3 = High School
4 = Graduate
5 = Postgraduate

Nominal

Indicates membership to collection or


Categorical category.
There is no implied ordering.

Ordinal

Categorical

Interval

Numeric

Indicates a difference with direction.


Amount of difference are in equal intervals.

eg: Age
Recorded in whole
years

Ratio

Numeric

Indicates a difference with direction.


Amount of difference are in equal intervals
A zero point is defined.

eg: Income

Indicates a difference, and indicates the


direction of the difference.
The items in the category can be arranged
from
low to high.
Difference between items are not in equal
intervals

2 SPSS: Reading data into SPSS


The SPSS program has an interface for data entry. We were introduced to that interface in Session
I. When a researcher decides to use SPSS for data analysis, it is more than likely that the data has
already been collected and stored using an office productivity tool like a spreadsheet program.
Data from external sources can be read into SPSS through the following steps:
1. In the SPSS program, navigate to File Open Data
A dialogue box will open.
2. In the dialogue box, click on the down arrow against the field named Files of Type.
Choose Excel (*.xls, *.xlsx, *.xlsm)
3. Navigate to the folder containing the Excel file that holds your data and select that file. [Use
the file titanic_ex_II.xlsx that was sent to you.]
4. Click Open
A dialogue box appears.
5. Make sure the check-box is ticked against the label Read Variable names from the first
row of data.
6. Click OK
The Excel file is loaded into SPSS.
Click on the Data View tab at the bottom of the screen.
Viola! You see the data just as you did in your spreadsheet program.

Click on the Variable View tab at the bottom of the screen.


This screen displays the names of the columns from the imported Excel file and the properties
associated with each column.
You have successfully read an external data source into SPSS.
SPSS can recognize and read data directly from a select list of formats (as can be seen in the
drop-down for File of Type field of the File Open Data dialogue box.
Now that we have imported the data into SPSS, you can view the imported the data in the Data
View and Variable View screens.
The Data View screen displays the data in rows and columns (like a spreadsheet). You can scroll
down the screen to verify that all the data has been correctly imported into the appropriate
columns.
The Variable View screen displays the column names and properties of the data contained in each
column.

3 SPSS: Defining Variables


When you examine the imported data closely, you may notice that the column names are cryptic
(or if you had spaces in the column names of the spreadsheet, the spaces are removed and the
column name is a string of concatenated words). SPSS column names cannot contain spaces and
a few other special characters.
In SPSS, column names are called 'variables'.
It is considered good practice to assign descriptive labels to these variables and define their
properties before proceeding with the analysis of the data.
Defining a variable involves giving it a name, specifying its type, the values the variable can take
(e.g., 1, 2, 3), the scale of measurement and so on.
Variable definitions can be done in SPSS any of the following two screens:
1. The Variable View screen
2. Data Define Variable Properties screen

1. The Variable View screen


The Variable View screen lists the variables (columns) in the data file and the properties
associated with each of those variables:

Table - 02
Property

Description

Name

The name of the variable. Variable names can not contain spaces. To
change a variable's name, double-click on the variable that you wish
to re-name. Type your new variable name.

Type

The type of variable. This column refers to how the data is stored, the
number of characters it can contain besides other formatting
information. This is not to be confused with the Type of Variables
discussed at the beginning of Session II.
SPSS recognizes the following types:
Numeric, Comma, Dot, Scientific notation, Date, Dollar, Custom
currency, String and Restricted Numeric (integer with leading zeros)
To change a variable's type, click inside the cell corresponding to the
Type column for that variable. A square "..." button will appear; click
on it to open the Variable Type window. Click the option that best
matches the type of variable. Click OK.

Width

The number of digits displayed for numerical values or the number of


characters for a string variable.

Decimals

The number of digits after a decimal point for each value of the
variable (applicable to numeric variables)

Label

A descriptive definition or display name for the variable. The variable


label appears in the output in place of its name (often vriptic)
Example: The variable sibsp might be described by the label
Number of Siblings or Spouse on board".

Value

For coded categorical variables, the value label(s) that should be


associated with each category code. Value labels are useful primarily
for categorical (i.e., nominal or ordinal) variables, especially if they
have been recorded as codes (e.g., 1, 2, 3). It is good practice to give
each value a label so that you (and anyone looking at your data or
results) understands what each value represents.
Example: In the sample dataset, the variable pclass represents the
Passenger Class. The values 1, 2, 3 represent the categories 1st
Class, 2nd Class and 3rd Class, respectively.

Missing

The user-defined values that indicate data are missing for a variable
(e.g., -99). Note that this does not affect or eliminate SPSS's default
missing value code ("."). This column merely allows the user to specify
alternative codes for missing values.

Columns

The width of each column in the Data View spreadsheet.

Align

The alignment of content in the cells of the Data View spreadsheet.

Measure

The level of measurement for the variable (e.g., nominal, ordinal, or


scale).

Role

The role that a variable will play in your analyses (i.e., independent

variable, dependent variable, both independent and dependent). Some


options in SPSS allow you to pre-select variables for particular
analyses based on their defined roles. Any variable that meets the role
requirements will be available for use in such analyses. You can choose
from the following roles for each variable:
Input: The variable will be used as a predictor (independent
variable). This is the default assignment for variables.
Target: The variable will be used as an outcome (dependent
variable).
Both: The variable will be used as both a predictor and an
outcome (independent and dependent variable).

2. Data Define Variable Properties screen


The Define Variable Properties window is an efficient way of defining many variables at once, or
defining many variables that share the same formatting. Click Data Define Variable Properties.
Figure - 01

The Define Variable Properties window will open.


Figure - 02

Select the variables you wish to define in the box on the left and click on the blue arrow button.
The selected variables will be moved to the box on the right under the heading 'Variables to
Scan. The Continue button is now enabled.

Click on Continue.
SPSS will scan the selected variables and identify the existing properties associated with those
variables and display them in a screen where you can view and change the properties for each
variable as shown in the following screen.
Figure - 03

On the screen in Figure - 03 you select each variable in turn from the scanned variables list and
enter the properties as described in Table - 02.
When you are done describing all the variables click OK
ADVANCED:
When you have completed defining the properties of all the variables, instead of clicking on the
OK button, you can click on the Paste button. This will open the SPSS Syntax Editor screen
into which all the SPSS commands used to define the variable properties will be pasted.
You can save this syntax into a file for future use. The next time if you have to import your file
again into SPSS, you will not need to go through all the steps shown above to define the
variable properties. You can open the syntax file you save and execute all the commands in it.
The variable properties will be defined.

Concept Check:
1) Give 3 examples of Nominal variables in the Titanic dataset.
ANSWER:
3) What is the difference between Nominal and Ordinal variables?
ANSWER:
4) List the variables in the Titanic dataset that:
a) Can be placed on a scale of measurement.
ANSWER:
b) Can be considered Ordinal Variables.
ANSWER:
c) Are strings.
ANSWER:
5) Can .docx files be read into SPSS ?
ANSWER:

4 Inspecting the data: Frequency Distributions


Before we get on with the analysis of the data, we need to inspect the data in order to:

spot abnormalities and data entry errors


observe extreme values (example Age could have been entered as 250 in a particular
case)
check if data for each variable is within the defined range
check for missing values
identify variables that can be recoded into groups (e.g. Fare could be recoded into: Low,
Medium and High)
get a general feel about the integrity and suitability of the data for further analysis

A useful first step is to use the SPSS Frequencies command found from the menu.
1. Click on Analyze Descriptive Statistics Frequencies
2. Select all the variables in the list (except ones that represent serial number of cases or in
the example data set the Name of Passenger variable because one would expect a
name to be unique to a passenger).
3. Click on the Statistics button
4. In the Frequency statistics window, place a check mark against: Mean, Median, Mode and
any other optional statistic that you may be interested in examining.
5. Click on Continue
6. Click on Close
SPSS opens an Output Window and displays pages of summary statistics and frequency tables

for all the selected variables.


The summary statistics table gives the mean, median and mode for each variable. The mean is
meaningful only for numeric scale variables like Age and Fare. It also shows the number of
missing cases for each variable.
Inspect the frequency distribution table of each variable.
From the frequency tables, it is easy to spot:

abnormal and extreme values (example Age could have been entered as 250 in a
particular case)

data that is outside the defined range for a variable

number of cases with missing values ( i.e cases which have no data recorded for the
variable)

identify variables that can be recoded into groups (e.g. Fare could be recoded into: Low,
Medium and High)

get a general feel about the integrity and suitability of the data for further analysis

As you would have observed, for variables measured on a scale (like Age and Fare), the
frequency table could be very long because each case is likely to have a unique number.
For scale variables, it is more informative to generate descriptive statistics.
1. Go to Analyze Descriptive Statistics Descriptives
2. Select the variables Age and Fare
3. Set the Options for the statistics you wish to see
4. Click OK.
We have used the Frequency distribution here to detect wrongly coded variables, to spot
abnormalities / extreme values in the data.
However the Frequency distribution plays a greater role in statistics. It provides a useful summary
of the data being studied. It is a part of a collection of statistics known as Descriptive Statistics
which are used to describe the data. In particular the frequency distribution gives measures of
central tendency and dispersion, indicating the mean, median and mode and spread of the data for
each variable.

Test - 1
Look at the outputs of the Descriptive Statistics and Frequencies command and answer the
following:
1) What is the mean Fare paid by passengers on the Titanic ?
ANSWER:
2) What is the mode of the Fares paid by passengers on the Titanic ?
ANSWER:
3) How many cases in the Titanic dataset do not have Age entered ?
ANSWER:
4) What is the mean Age of passengers on the Titanic ?
ANSWER:
5) What is the median Age of passengers on the Titanic ?
ANSWER:
6) What is the proportion of passengers on the Titanic who survived ?
ANSWER:
7) How many passengers on the Titanic did not pay any fare ?
ANSWER:

5 SPSS: Histograms
While the Frequency distribution displays a table of numbers that summarizes the distribution of
values of each variable, showing how the values are spread from minimum to maximum, the
Histogram provides a graphical representation of the distribution.
In SPSS, histograms are produced from the same menu option that produced frequency tables.
1. Click on Analyze Descriptive Statistics Frequencies
2. Select the variables for which you want to produce histograms (select Age and Pclass as
an example)
3. At the bottom of the variable select screen, uncheck the check-box against the label
Display Frequency Tables
4. Click on the Charts button
5. Select the radio button Histograms
6. Click on Continue
7. Click on Close
The histogram will be displayed in the currently open SPSS output window.
Figure - 04

6 Correcting and Cleaning Data


The process of inspecting the data through frequency distributions and histograms, often reveal
input errors and other problems with the data. The errors identified in the previous section need to
be corrected before proceeding with analysis.
What are these errors that we are talking about and how do we correct them if we find such errors?
Typical examples of data errors could be:

incorrect coding of values

typing mistakes

shifting of data from one column into the neighboring column

outliers or extreme values

Data cleaning activity typically takes a large chunk of time in data analysis. It is a very important
step nevertheless because erroneous data can lead to erroneous conclusions.
This session will be conducted as a hands-on exercise under supervision, according to the
following instructions.
Lab Exercise: Correcting and Cleaning Data
1. Read the supplied data file: titanic_ex_II.csv
2.
3.
4.
5.
6.
7.
8.

Re-run the commands used in Section 4 - Inspecting the data


Inspect the outputs produced.
Make a list of the errors identified in the outputs.
Identify the cases which have these errors.
Correct the errors using the data editor.
Re-run the commands used in Section 4 to confirm that the errors have been rectified.
Save the data file.

Session II: Homework Exercise:


1. Read the data from the file body.csv into SPSS. Study the accompanying file body.txt
which provides information about the dataset.
2. The article associated with this data set appears in the Journal of Statistics Education,
Volume 11, Number 2 (July 2003). Read this article here:
http://www.amstat.org/publications/jse/v11n2/datasets.heinz.html
3. Once the data has been read into SPSS, assign meaningful variable labels and value
labels, using the information provided in the file body.txt.
4. Produce frequency tables, histograms and box plots from this dataset.

OR
1. Read the data from the file cafedata.xls into SPSS. Study the accompanying file
cafedata_documentation.txt which provides information about this dataset.
2. The article associated with this data set appears in the Journal of Statistics Education,
Volume 19, Number 1 (March 2011) issue. Read this article here:
http://www.amstat.org/publications/jse/v19n1/depaolo.pdf
3. Once the data has been read into SPSS, assign meaningful variable labels and value
labels.
4. Produce some frequency tables and histograms.

Online Resources:
1. https://statistics.laerd.com/statistical-guides/types-of-variable.php
2. https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median.php

You might also like