You are on page 1of 44

METHODS FOR

DESCRIBING
S E T S O F DATA

Two methods for describing data are


presented in this lecture, one graphical
and the other numerical.
DESCRIBING QUALITATIVE DATA
• Qualitative data are non-numerical in nature;
thus, the value of a qualitative variable can only
be classified into categories called classes.
• We can summarize such data numerically in two
ways:
( I ) by computing the class frequency-the number of
observations in the data set that fall into each class;
or
(2) by computing the class relative frequency-the
proportion of the total number of observations falling
into each class.
Class
• A class is one of the categories into which
qualitative data can be classified.
• The class frequency is the
number of observations
in the data set falling into
a particular class.
Class relative frequency

• The class relative


frequency is the class
frequency divided by
the total numbe of
observations in the data
set.
Relative frequency
• The relative frequency
by dividing the class
frequency by the total
number of
observations in the
data set.
• Thus, the relative
frequencies for the
four degree types are:
Graphical methods for describing
qualitative
• Most widely used graphical methods for
describing qualitative data bar graphs and pie
charts.
Dot Plots
• The numerical value of each measurement in
the data set is located on the horizontal scale by
a dot. When data values repeat, the dots are
placed above one another, forming a pile at that
particular numerical location.
A stem-and-leaf display
• A stem-and-leaf display is a device for presenting quantitative
data in a graphical format, similar to a histogram, to assist in
visualizing the shape of a distribution. They evolved from
Arthur Bowley's work in the early 1900s, and are useful tools
in exploratory data analysis.

A stem-and-leaf
display is often called a
stemplot, but the
latter term often refers
to another chart type
Stem and leaf display
44 46 47 49 63 64 66 68 68 72 72 75 76 81 84 88 106
4 |4679
5 |
6 |3 4 6 8 8
7 |2 2 5 6
8 |1 4 8
9 |
10 |6
Key 6 |3=63
Leaf unit =1.0
Steam unit=10.0
Histograms
• A histogram is a graphical representation of the distribution
of data. It is an estimate of the probability distribution of a
continuous variable and was first introduced by Karl
Pearson.
• A histogram is a representation of tabulated frequencies,
shown as adjacent rectangles, erected over discrete
intervals (bins), with an area equal to the frequency of the
observations in the interval.
• The height of a rectangle is also equal to the frequency
density of the interval, i.e., the frequency divided by the
width of the interval.
• The total area of the histogram is equal to the number of
data.
Histograms
• First described
by Karl Pearson
• Purpose :To
roughly assess
the probability
distribution of a
given variable
by depicting
the frequencies
of observations
occurring in
certain ranges
of values
SUMMATION NOTATION
• We denote the measurements of a quantitative
data set as follows:
• xI, x2, x3,. ... x, where xl is the first measurement in
the data set, x2 is the second measurement in the
data set, x3 is the third measurement in the data
set,. .., and x, is the nth (and last) measurement in
the data set. Thus, if we have five measurements
in a set of data, we will write x1, x2, x3, x4, x5 to
represent the measurements.
• If the actual numbers are 5, 3, 8, 5, and 4, we
have x1= 5, x2 = 3, x3 = 8, x4 = 5, and x5 = 4.
• Most of the formulas we use require a
summation of numbers. For example, one
sum we'll need to obtain is the sum of all the
measurements in the data set, or x1 + x2 + x3 +
. ..a + xn
• To shorten the notation, we use the symbol 2
for the n summation.
• That is
Verbally translate

As follows: "The sum of the measurements, whose


typical member is x,, beginning with the member x1
and ending with the member xn."
NUMERICAL MEASURES OF CENTRAL TENDENCY

• Most of these methods measure one of two data characteristics:


1. The central tendency of the set of measurements-that is, the tendency of the
data to cluster, or center, about certain numerical values
2. The variability of the set of measurements-that is, the spread of the data

• In this section we concentrate on measures of central tendency. Tn the


next section, we discuss measures of variability.
• The most popular and best-understood measure of central tendency for a
quantitative data set is the arithmetic mean (or simply the mean) of a data
set.

• The most popular and best-understood measure of central tendency for a


quantitative data set is the arithmetic mean (or simply the mean) of a data
set.
Mean
• The mean of a set of quantitative data is the sum of
the measurements divided by the number of
measurements contained in the data set.

Calculate the mean of thc following five samplc


measurements: 5 , 3 , 8 , 5 , 6 .
Sample mean and population mean
• The sample mean x will play an important role in
accomplishing our objective of making inferences about
populations based on sample information.
• For this reason we need to use a different symbol for the
mean of a population-the mean of the set of
measurements on every unit in the population.
• We use the Greek letter  (mu) for the population mean.
Median
• The median of a quantitative data set is the middle number
when the measurements are arranged in ascending (or
descending) order.
If the data set is characterized by a
relative frequency histogram, the
median is the point on the x-axis
such that half the area under the
histogram lies above the
medianand half lies below
Mode
• The mode is the measurement that occurs
most frequently in the data set.
• Each of 10 taste testers rated a new brand of
barbecue sauce on a 10 point scale, where 1 =
awful and 10 = excellent. Find the mode for
the 10 ratings shown below.
87968109957
• S o I u t i o n: Since 9 occurs most often, the mode of
the 10 taste ratings is 9.
NUMERICAL MEASURES OF VARIABILITY

• Measures of central tendency provide only a


partial description of a quantitative data set.
• The description is incomplete without a
measure of the variability, or spread, of the
data set.
• Knowledge of the data's variability along with
its center can help us visualize the shape of a
data set as well as its extreme values.
Range
• The simplest measure of the variability of a
quantitative data set is its range.
• The range of a quantitative data set is equal to
the largest measuremen minus the smallest
measurement.
• The range is easy to compute and easy to
understand, but it is a rather insensitive measure
of data variation when the data sets are large.
• This is because two data sets can have the same
range and be vastly different with respect to data
variation.
Variance
• Variance measures how far a set of numbers is
spread out. (A variance of zero indicates that
all the values are identical.)
• Variance is always non-negative: A small
variance indicates that the data points tend to
be very close to the mean (expected value)
and hence to each other, while a high variance
indicates that the data points are very spread
out from the mean and from each other.
Sample variance
• The sample variance for a sample of n measurements is equal to
the sum of the squared deviations from the mean divided by (n - 1).
In symbols, using s2 to represent the sample variance,
Standar deviation
• The standard deviation (SD) (represented by
the Greek letter sigma, σ) shows how much
variation or dispersion from the average
exists.
• A low standard deviation indicates that the
data points tend to be very close to the mean
(also called expected value); a high standard
deviation indicates that the data points are
spread out over a large range of values.
Sample standard deviation
• The sample standard deviation, s, is defined as the
positive square root of the sample variance, s2. .
Thus,

Symbols for variance and standar deviaton sample and


population
INTERPRETING THE STANDARD DEVIATION

• We've seen that if we are comparing the variability of two samples


selected from a population, the sample with the larger standard
deviation is the more variable of the two.
• Thus, we know how to interpret the standard deviation on a relative
or comparative basis, but we haven't explained how it provides a
measure of variability for a single sample.
• To understand how the standard deviation provides a measure of
variability of a data set, consider a specific data set and answer the
following questions: How many measurements are within 1
standard deviation of the mean?
• How many measurements are within 2 standard deviations? For a
specific data set, we can answer these questions by counting the
number of measurements in each of the intervals.
• However, if we are interested in obtaining a general answer to
these questions, the problem is more difficult.
Interpreting the Standard Deviation:
Chebyshev's Rule
lnterpreting the Standard Deviation:
The Empirical Rule
NUMERICAL MEASURES OF RELATIVE STANDING

• We've seen that numerical measures of central


tendency and variability describe the general
nature of a quantitative data set (either a sample
or a population).
• In addition, we may be interested in describing
the relative quantitative location of a particular
measurement within a data set.
• Descriptive measures of the relationship of a
measurement to the rest of the data are called
measures of relative standing
Percentile ranking
• One measure of the relative standing of a
measurement is its percentile ranking.
Z-score
• Another measure of relative standing is popular
use is the z-score.
The sample 2-score for a measurement x is

The population z-score for a measurement x is


Interpretation of z-score

You might also like