You are on page 1of 26

3.

1 DESCRIBING VARIATION
3.1.1 The Stem-and-Leaf Plot
3.1.2 The Histogram
3.1.3 Numerical Summary of Data
3.1.4 The Box Plot
3.1.5 Probability Distributions
3.1.1 The Stem-and-Leaf Plot
 Suppose that the data are represented by x1, x2, . . . , xn
and that each number xi consists of at least two digits.
To construct a stem-and-leaf plot, we divide each
number xi into two parts: a stem, consisting of one or
more of the leading digits; and a leaf, consisting of the
remaining digits.

For example, if the data consist of percent defective


information between 0 and 100 on lots of semiconductor
wafers, then we can divide the value 76 into the stem 7
and the leaf 6.
see example 3.1 Health Insurance Claims
3.1.1 The Stem-and-Leaf Plot
 The version of the stem-and-leaf plot produced by Minitab is
sometimes called an ordered stem-and-leaf plot, because the
leaves are arranged by magnitude. This version of the display makes it
very easy to find percentiles of the data. Generally, the 100 kth
percentile is a value such that at least 100 k% of the data values are at
or below this value and at least100 (1 − k)% of the data values are at
or above this value.
 The fiftieth percentile of the data distribution is called the sample
median .
 If n, the number of observations, is odd. First sort the observations in
ascending order (or rank the data from smallest observation to
largest observation). Then the median will be the observation in rank
position [(n − 1)/2 + 1] on this list.
 If n is even, the median is the average of the (n/2)st and (n/2 + 1)st
ranked observations.
3.1.2 The Histogram
 A histogram is a more compact summary of data than
a stem-and-leaf plot.
 To construct a histogram for continuous data, we must
divide the range of the data into intervals, which are
usually called class intervals, cells, or bins.
*the bins should be of equal width to enhance the
visual information in the histogram.
 To construct the histogram, use the horizontal axis to
represent the measurement scale for the data and the
vertical scale to represent the counts, or frequencies.
3.1.3 Numerical Summary of Data
 The stem-and-leaf plot and the histogram
provide a visual display of three
properties of sample data:
 the shape of the distribution of the data,
 the central tendency in the data, and
 the scatter or variability in the data.
It is also helpful to use numerical
measures of central tendency and scatter.
 Suppose that x1, x2, . . . , xn are the
observations in a sample. The most
important measure of central tendency in
the sample is the sample average,

 Note that the sample average is simply


the arithmetic mean of the n
observations.
 The variability in the sample data is
measured by the sample variance,

 Note that the sample variance is simply


the sum of the squared deviations of each
observation from the sample average
divided by the sample size minus one.
 The units of the sample variance s2 are the
square of the original units of the data. This is
often inconvenient and awkward to interpret,
and so we usually prefer to use the square root
of s2, called the sample standard deviation s,
as a measure of variability.
 consider the two samples shown here:

 This leads to an important point: The standard deviation


does not reflect the magnitude of the sample data,
only the scatter about the average.
 A more efficient formula is
3.1.4 The Box Plot
 The box plot is a graphical display that simultaneously displays
several important features of the data, such as location or central
tendency, spread or variability, departure from symmetry, and
identification of observations that lie unusually far from the bulk of
the data (these observations are often called “outliers”).
 A box plot displays the three quartiles, the minimum, and the
maximum of the data on a rectangular box, aligned either
horizontally or vertically.
 A line at either end extends to the extreme values. These lines are
usually called whiskers.
 modified box plot
3.1.5 Probability Distributions
 A sample is a collection of
measurements selected from some larger
source or population.
A probability distribution is a
mathematical model that relates the value
of the variable with the probability of
occurrence of that value in the
population.
 random variable
 There are two types of probability distributions:
 The mean m of a probability distribution is a measure of the
central tendency in the distribution, or its location. The mean is
defined as

 The scatter, spread, or variability in a distribution is expressed by


the variance. The definition of the variance is

 The standard deviation 𝝈.


3.2 IMPORTANT DISCRETE
DISTRIBUTIONS
3.2.1 The Hypergeometric Distribution
3.2.2 The Binomial Distribution
3.2.1 The Hypergeometric
Distribution
 Suppose that there is a finite population
consisting of N items. Some number—say,
—of these items fall into a class of
interest. A random sample of n items is
selected from the population without
replacement, and the number of items in the
sample that fall into the class of interest—
say, x—is observed. Then x is a
hypergeometric random variable with the
probability distribution defined as follows.
3.2.2 The Binomial Distribution
 Consider a process that consists of a sequence of n independent trials. By
independent trials, we mean that the outcome of each trial does not depend in any
way on the outcome of previous trials. When the outcome of each trial is either a
“success” or a “failure,” the trials are called Bernoulli trials. If the probability of
“success” on any trial—say, p—is constant, then the number of “successes” x in n
Bernoulli trials has the binomial distribution with parameters n and p, defined as
follows.

You might also like