You are on page 1of 9

IHP-340: Basic Tools to Describe Data

Basic Tools to Describe Data:


When describing a set of data, it is useful to characterize the middle (measures of central
tendency) of the data set and how spread out those data are (measures of variability).
Measures of Central Tendency:
What is a measure of central tendency? The best way to explain this is by giving an example.
You have taken an exam, and the instructor hands you back your graded exam; you received
a 57% on it. One of the first things you would ask the instructor, besides any content-related
questions, is: "How did the class do overall?" What response are you expecting? Maybe he or
she will give you the average score or maybe give you the highest grade achieved. Is there
one number that can be reported to you that describes or represents the set of exam grades?
You probably will not feel so bad if the average was 45% or if you received the highest score.
In this section, we will talk about what numbers best describe the "center" of a set of data. We
will also discuss (in the section below) numbers we use to describe how spread out or variable
those data are.
The Mean: You probably know this as the average, and it is the most commonly used measure of
central tendency. You calculate this by adding up the individual scores and then dividing by the
number of scores there were. There are situations where the mean can give you misleading
information about the data; we will talk about this below.
The Median: The median is the score that sits at the exact middle position of a set of data. It
is also known as the 50 th percentile, where half of the scores fall above it and half fall below
it. To find the median, you would arrange the scores in ascending (or descending) order,
count the number of scores, and find the number that is in the middle. When there is an odd
number of scores, the median is very easy to find. When there is an even number of scores,
you find the average of the two scores that are in the middle. You will see an example later.
The Mode: The mode is probably the least useful measure of central tendency. It is the most
frequently occurring score in a set of data. There can be more than one mode if there is more
than one score that occurs multiple times. If there are two modes, the distribution is called
bimodal.
Example: Consider the following data from an exam: Exam scores: 34, 35, 42, 42, 51, 53, 57.
In this scenario, the mean = 45% (adding the numbers gives you 314 and there are 7
scores = 314/7 = 45%). The mode is 42 because it occurred the most frequently. The
median is 42 because it sits in the middle position. What would happen if you had
received a 100% instead of a 57%? The average would now be 51%. Just changing one
person's score pulled the mean up by 6%. Did the median change? No. Why not?
Because the median is based on position and the number of scores; since the number
of scores did not change when we changed your score from a 57% to a 100%, the
median did not change. So, the mean utilizes the value of every score, which most of
the time is a benefit, except when there are extreme scores present in the data. The
median does not utilize the value of every number; it simply depends on the number
of scores and the value of the one score that falls in the center. Because it does not

consider the values of all the scores, it is not as sensitive a measure of central
tendency but it is
advantageous when there are extreme scores present in the data.

IHP-340: Basic Tools to Describe Data

Let's expand the example. Let's say there were more people in the class and the exam
scores were as follows: one 34, three 35s, six 42s, seven 47s, five 51s, four 53s, one 57,
and two 100s.
The average is 50% but everyone is looking at each other's scores, and it does not
seem to make sense. Out of the 29 scores, 17 people scored below the mean, and 5
more people scored just slightly above the mean. It seems that the middle should
be lower. Is 50% really describing the set of scores? The median in this case, 47, is a
better representation of the central tendency of these data. You can keep going with
this illustration. Change the 57 to another 100 and see what happens to the mean
while the median remains at 47.
Measures of Variability:
While it is very useful to know where the center of the data is, we are also often interested in
the distribution or the spread of the scores around that center. In the example above using
exam scores, the spread of scores showed an interesting pattern. The majority of the class
scored between 34 and 57 and two students scored a perfect 100%. Remember that in this
case the average was 50%, and the median was 47%. Can you imagine that there could be a
wide range of distributions that would still result in a mean of 50%? There could have been
two zeros and two 100s, which would have given a mean of 50%, or every student could have
scored 50%, or there could have been four 45s and four 55s, etc. Can you see that the
"center" of the data describes a quality that is different than the spread or dispersion of the
scores? How can we describe variability?
The Range: The range is a very simple way of describing the spread of data. It is simply the
highest minus the lowest score. You can see that in the examples given in the paragraph above,
the ranges would be very different: 100-34=66; 100-0=100; 50-50=0; 55-45=10 This tells us
something about the spread of scores but only relies on two scores, the highest and lowest.
Standard Deviation: The standard deviation is a calculation that uses every score to
describe the variability. In this calculation, every score is subtracted from the mean to get
its distance from the middle of the data. There are two aspects of the standard deviation
that are important: the calculation and the interpretation. The calculation is not hard once
you understand some symbols used in statistics. The interpretation requires that you
understand the concept of the normal curve.
Interpreting Measures of Central Tendency and Variability:
In order to interpret measures of central tendency and variability, you need to understand
the normal curve. Many of you could probably draw a normal curve, but could you describe
what it represents or identify any of its major characteristics? A normal curve is a
frequency distribution of a particular variable. When enough people are included in this
frequency distribution, the curve that results becomes more and more "normal." Normal
means that the curve is bell-shaped and symmetric, with the majority of scores clustering
at the center and fewer scores at either end.

Refer to the Normal Curve lecture and see examples of how these curves can vary. Skewness
refers to whether the distribution is shifted up or down with a long tail. This means that some
people scored at one extreme while the majority of people's scores clustered around a point at
the opposite end. Kurtosis describes another characteristic of the curve which is how peaked or
flat the curve is.

IHP-340: Basic Tools to Describe Data

Before we talk about the importance of the normal curve, let's make sure we all know what
we are looking at. The normal curve is a graph of a particular variable. Let's say that it is
the body weight of a group of people. The x-axis (horizontal axis) shows the range of scores
with M being the mean. The curve represents the number of people who scored each score
so the y-axis (vertical axis) is the number of scores. Instead of a general picture of a normal
curve, let me show you one using real data. The graph below shows the body weight of 240
men measured as part of a cardiovascular health study. You can see that this is a bar graph
that shows how many people scored each small range (each bar) of body weights. Can you
imagine a normal curve drawn over this graph? If it were skewed, in which direction would
it be?

IHP-340: Basic Tools to Describe Data

Central Tendency, Variability and the Normal Curve:


Let's use the graph of body weight above as an example. The peak of this curve should show
the most frequently occurring score, which is the mode, if the bars on this bar graph were
skinny enough. Where the scores cluster will be the mean and the median will be the score
that holds the middle position. If the curve were perfectly bell-shaped and symmetrical, or
perfectly normal, the mean, mode, and median would fall in the same spot. If there are
scores that fall on either end of the graph, the mean would be pulled toward the extreme
score, while the median would remain unaffected; this would result in a skewed curve. A
curve with a long tail toward the positive side is termed a positive skew, and a long tail
toward the negative side is termed a negative skew.
Let's go back to the general normal curve shown earlier. Notice that on either side of the
mean the x-axis is labeled with positive 1, 2, 3 and negative 1, 2, and 3. These are
standard deviations and the definition of a standard deviation requires the normal curve.
By definition, the interval given by adding one standard deviation to the mean and
subtracting one standard deviation from the mean includes the middle 68% of the scores in
the sample. If data are more spread, you will see that it takes a wider interval to include
plus or minus one standard deviation. If the data are more clustered, it takes a narrower
interval to include 68% of the scores. You can add and subtract to get 2 (95% of scores)
and 3 (99% of scores) standard deviations around the mean. Notice that it is very rare for
anyone to score more than 3 standard deviations from the mean because 3 standard
deviations include 99% of the scores.
The Importance of the Normal Curve:
A phenomenon that can be regularly observed is that when you measure nearly any variable,
if you measure enough people, the scores will end up taking the shape of a normal distribution
if graphed. There tends to be clustering around a certain point and then variability around that
so that there are fewer and fewer people who fall at either extreme. In many ways, this
occurrence is the basis for statistics. We will revisit this concept over and over throughout the
semester in various ways (which is good news because if you are struggling now you will have
more chances to understand as we progress).
Many of the statistical tests you will learn about in this course will be based on the assumption
that the data you are dealing with is normally distributed.
Vocabulary:
The following definitions are from: Thomas, J.R., Nelson, J.K., & Silverman, S.J. (2005). Research
Methods in Physical Activity, (5th ed.). Champaign, IL: Human Kinetics.
Central tendency (measure of): a single score that best represents all of the scores
Frequency distribution: a distribution of scores including the frequency with which they
occur
Frequency intervals: small ranges of scores within a frequency distribution into which
scores are grouped
Inference: generalization of results to some larger group.

Kurtosis: description of the vertical characteristic of the curve showing the data
distribution, for example, whether the curve is more peaked or flatter than the normal
curve

IHP-340: Basic Tools to Describe Data


Mean: a statistical measure of central tendency that is the average score of a group
of scores Median: a statistical measure of central tendency that is in the middle in a
group
Mode: a statistical measure of central tendency that is the most frequently occurring
score of the group
Normal curve: distribution of data (a frequency distribution) in which the mean, median,
and mode
are at the same point (center of distribution) and in which 1s from the mean includes
68% of scores, 2s from the mean includes 95% of the scores, and 3s includes 99%
of the scores
Population: the larger group from which a sample is taken
Sample: a group of participants, treatments, or situations selected from a larger
population
Sampling Distribution: a frequency distribution of the scores of a variable measured
in many samples selected from the same population
Skewness: description of the direction of the hump of the curve of distribution of data
and the nature of the tails of the curve
Standard deviation: an estimate of the variability of the scores of a group around
the mean Standard error: the variability of a sampling distribution

Variability: the degree of difference between each individual score and the central
tendency score Variance: the square of the standard deviation

You might also like