You are on page 1of 36

Important Statistical Terms

for SPSS
Statistics
Statistics - a set of concepts, rules, and
procedures that help us to:
organize numerical information in the form of
tables, graphs, and charts;
understand statistical techniques underlying
decisions that affect our lives and well-being;
and
make informed decisions.
Type of Data
Data - facts, observations, and information that
come from investigations.
Measurement data sometimes called quantitative
data -- the result of using some instrument to
measure something (e.g., test score, weight);
Categorical data also referred to as frequency or
qualitative data. Things are grouped according to
some common property(ies) and the number of
members of the group are recorded (e.g.,
males/females, vehicle type).
Types of Variables
Variable - property of an object or event that can take on
different values. For example, college major is a variable that
takes on values like mathematics, computer science, English,
psychology, etc.
Discrete Variable - a variable with a limited number of values
(e.g., gender (male/female), college class
(freshman/junior/senior).
Continuous Variable - a variable that can take on many
different values, in theory, any value between the lowest and
highest points on the measurement scale.
Independent Variable - a variable that is manipulated,
measured, or selected by the researcher as an antecedent
condition to an observed behavior. In a hypothesized cause-
and-effect relationship, the independent variable is the cause
and the dependent variable is the outcome or effect.
Dependent Variable - a variable that is not under the
experimenter's control -- the data. It is the variable that is
observed and measured in response to the independent
variable.
Qualitative Variable - a variable based on categorical data.
Quantitative Variable - a variable based on quantitative
data.
Types of Graphs
Graphs - visual display of data used to present frequency distributions
so that the shape of the distribution can easily be seen.
Bar graph - a form of graph that uses bars separated by an arbitrary
amount of space to represent how often elements within a category
occur. The higher the bar, the higher the frequency of
occurrence. The underlying measurement scale is discrete, not
continuous.
Histogram - a form of a bar graph used with interval or ratio-scaled
data. Unlike the bar graph, bars in a histogram touch with the width of
the bars defined by the upper and lower limits of the interval. The
measurement scale is continuous, so the lower limit of any one interval
is also the upper limit of the previous interval.
Boxplot - a graphical representation of dispersions and extreme
scores. Represented in this graphic are minimum, maximum, and
quartile scores in the form of a box with "whiskers." The box includes
the range of scores falling into the middle 50% of the distribution
(Inter Quartile Range = 75th percentile - 25th percentile)and the
whiskers are lines extended to the minimum and maximum scores in
the distribution or to mathematically defined (+/-1.5*IQR) upper and
lower fences.
Scatterplot - a form of graph that presents information from a
bivariate distribution. In a scatterplot, each subject in an experimental
study is represented by a single point in two-dimensional space. The
underlying scale of measurement for both variables is continuous
(measurement data). This is one of the most useful techniques for
gaining insight into the relationship between two variables.
Measures of Center
Measures of Center - Plotting data in a
frequency distribution shows the general
shape of the distribution and gives a general
sense of how the numbers are
bunched. Several statistics can be used to
represent the "center" of the
distribution. These statistics are commonly
referred to as measures of central
tendency.
Mean is the average of a
set of data. To calculate
the mean, find the sum of
the data and then divide
by the number of data.
12, 15, 11, 11, 7, 13
First, find the sum of the data.
12 + 15 +11 + 11 + 7 + 13 = 69
Then divide by the number of data.
69 / 6 = 11.5

The mean is 11.5


Median is the middle
number in a set of data
when the data is
arranged in numerical
order.
12, 15, 11, 11, 7, 13
First, arrange the data in numerical
order. 7, 11, 11, 12, 13, 15
Then find the number in the middle
or the average of the two numbers
in the middle.
11 + 12 = 23 23 / 2 = 11.5
The median is 11.5
First place the prices in numerical
order.
$100, $275, $300, $325, $350, $375,
$500

The price in the middle is the median


price.

The median price is $325.


The mode is the
number that occurs the
most.
12, 15, 11, 11, 7, 13

The mode is 11.


Sometimes a set of data will have more
than one mode.
For example, in the following set the
numbers both the numbers 5 and 7 appear
twice.

2, 9, 5, 7, 8, 6, 4, 7, 5

5 and 7 are both the mode and this set is said


to be bimodal.
Sometimes there is no mode in a set of data.

3, 8, 7, 6, 12, 11, 2, 1

All the numbers in this set occur only


once therefore there is no mode in this set.
Mean The average

Median The number or


average of the
numbers in the
middle

Mode The number that


occurs most
Measures of Skewness and Kurtosis
A fundamental task in many statistical analyses is to
characterize the location and variability of a data set
(Measures of central tendency vs. measures of
dispersion)
Both measures tell us nothing about the shape of the
distribution
A further characterization of the data includes
skewness and kurtosis
The histogram is an effective graphical technique for
showing both the skewness and kurtosis of a data set
Further Moments of the Distribution

There are further statistics that describe the shape of


the distribution, using formulae that are similar to
those of the mean and variance
1st moment - Mean (describes central value)
2nd moment - Variance (describes dispersion)
3rd moment - Skewness (describes asymmetry)
4th moment - Kurtosis (describes peakedness)
Further Moments Skewness

Skewness measures the degree of asymmetry exhibited


by the data
n

(x x)
i
3

skewness i 1
3
ns
If skewness equals zero, the histogram is symmetric
about the mean
Positive skewness vs negative skewness
Further Moments Skewness

Source: http://library.thinkquest.org/10030/3smodsas.htm
Further Moments Skewness
Positive skewness
There are more observations below the mean than
above it
When the mean is greater than the median

Negative skewness
There are a small number of low observations and
a large number of high ones
When the median is greater than the mean
Further Moments Kurtosis
Kurtosis measures how peaked the histogram is
n

(x x)
i
4

kurtosis i
4
3
ns
The kurtosis of a normal distribution is 3
Kurtosis characterizes the relative peakedness or
flatness of a distribution compared to the normal
distribution
Further Moments Kurtosis

platykurtic leptokurtic

Source: http://www.riskglossary.com/link/kurtosis.htm

Kurtosis is based on the size of a distribution's


tails.
Negative kurtosis (platykurtic) distributions with
short tails
Positive kurtosis (leptokurtic) distributions with
relatively long tails
Correlation
Correlation
A correlation is a relationship between two variables. The
data can be represented by the ordered pairs (x, y) where
x is the independent (or explanatory) variable, and y is
the dependent (or response) variable.
A scatter plot can be used to y

determine whether a linear 2


(straight line) correlation exists
between two variables. x
Example: 2 4 6

x 1 2 3 4 5 2

y 4 2 1 0 2
4
Linear Correlation
y y
As x increases, As x increases,
y tends to y tends to
decrease. increase.

x x
Negative Linear Correlation Positive Linear Correlation
y y

x x
No Correlation Nonlinear Correlation
Correlation Coefficient
The correlation coefficient is a measure of the strength
and the direction of a linear relationship between two
variables. The symbol r represents the sample correlation
coefficient. The formula for r is
n xy x y
r .
n x 2 x n y 2 y
2 2

The range of the correlation coefficient is 1 to 1. If x and


y have a strong positive linear correlation, r is close to 1.
If x and y have a strong negative linear correlation, r is
close to 1. If there is no linear correlation or a weak
linear correlation, r is close to 0.
Example:
Correlation Coefficient
Calculate the correlation coefficient r for the following data.
x y xy x2 y2
1 3 3 1 9
2 1 2 4 1
3 0 0 9 0
4 1 4 16 1
5 2 10 25 4
x 15 y 1 xy 9 x 2 55 y 2 15

n xy x y 5(9) 15 1
r
n x 2 x n y 2 y 5(55) 152 5(15) 1
2 2 2

60 There is a strong positive


0.986
50 74 linear correlation between
x and y.
If r = Zero this means no association or
correlation between the two variables.

If 0 < r < 0.25 = weak correlation.

If 0.25 r < 0.75 = intermediate correlation.

If 0.75 r < 1 = strong correlation.

If r = l = perfect correlation.
Regression Analyses

Regression: technique concerned with


predicting some variables by knowing others

The process of predicting variable Y using


variable X
Uses a variable (x) to predict some outcome
variable (y)
Tells you how values in y change as a
function of changes in values of x
Correlation and Regression

Correlation describes the strength of a linear


relationship between two variables
Linear means straight line

Regression tells us how to draw the straight


line described by the correlation
Regression
Calculates the best-fit line for a certain set of data
The regression line makes the sum of the squares of
the residuals smaller than for any other line
Regression minimizes residuals
SBP (mmHg)
220

200

180

160

140

120

100

80
Wt (kg)
60 70 80 90 100 110 120

You might also like