You are on page 1of 42

Copyright 2011 Pearson Education, Inc.

Describing Numerical
Data
Chapter 4
4.1 Summaries of Numerical Variables
Can 500 different songs fit on the iPod
Shuffle?

To answer this question we must understand the
typical length of a song and the variation of song
sizes around the typical length

We can do this using summary statistics

Copyright 2011 Pearson Education, Inc.
3 of 42
4.1 Summaries of Numerical Variables
A Subset of the Data


Copyright 2011 Pearson Education, Inc.
4 of 42
4.1 Summaries of Numerical Variables
The Median

Value in the middle of a sorted list of numerical
values (a typical value)

Half of the values fall below the median; half fall
above

It is the 50
th
Percentile


Copyright 2011 Pearson Education, Inc.
5 of 42
4.1 Summaries of Numerical Variables
Common Percentiles

Lower Quartile = 25
th
Percentile

Upper Quartile = 75
th
Percentile

One quarter of the values fall below the lower
quartile and one quarter fall above the upper
quartile




Copyright 2011 Pearson Education, Inc.
6 of 42
4.1 Summaries of Numerical Variables
The Interquartile Range (IQR)
IQR = 75
th
Percentile 25
th
Percentile

A measure of variation based on quartiles

Used to accompany the median





Copyright 2011 Pearson Education, Inc.
7 of 42
4.1 Summaries of Numerical Variables
The Range
Range = Maximum - Minimum

Maximum Value = 100
th
Percentile

Minimum Value = 0
th
Percentile

Another measure of variation; not preferred
because based on extreme values




Copyright 2011 Pearson Education, Inc.
8 of 42
4.1 Summaries of Numerical Variables
The Five Number Summary

Minimum
Lower Quartile
Median
Upper Quartile
Maximum





Copyright 2011 Pearson Education, Inc.
9 of 42
4.1 Summaries of Numerical Variables
The Five Number Summary for Song Sizes

Minimum = 0.148 MB
Lower Quartile = 2.85 MB
Median = 3.5015 MB
Upper Quartile = 4.32 MB
Maximum = 21.622 MB





Copyright 2011 Pearson Education, Inc.
10 of 42
4.1 Summaries of Numerical Variables
Summary Statistics for Song Sizes

Median = 3.5015 MB

IQR = 4.32 MB 2.85 MB = 1.47 MB

Range = 21.622 MB 0.148 MB = 21.474 MB




Copyright 2011 Pearson Education, Inc.
11 of 42
4.1 Summaries of Numerical Variables
The Mean (Average)

Arithmetic average; divide the sum of the values
by the number of values (another typical value)

The symbol y represents the variable of interest

The symbol read y bar represents the mean
Copyright 2011 Pearson Education, Inc.
12 of 42
y
4.1 Summaries of Numerical Variables
The Mean (Average)




Copyright 2011 Pearson Education, Inc.
13 of 42
1 2 n y y y
y
n
4.1 Summaries of Numerical Variables
The Variance (s
2
)

Is a measure of variation based on the
mean

How far a value is from the mean is known
as its deviation; the variance is the average
of the squared deviations





Copyright 2011 Pearson Education, Inc.
14 of 42
4.1 Summaries of Numerical Variables
The Variance




Copyright 2011 Pearson Education, Inc.
15 of 42
2
2 2 2
1 2
1
n
y y y y y y
s
n

4.1 Summaries of Numerical Variables


The Standard Deviation (SD)

Is the square root of the variance


Is a measure of variability in the original
units of the data (the variance results in
squared units)






Copyright 2011 Pearson Education, Inc.
16 of 42
2
s s
4.1 Summaries of Numerical Variables
Summary Statistics for Song Sizes

Mean = 3.7794 MB

Variance = 2.584 MB

SD = 1.607 MB






Copyright 2011 Pearson Education, Inc.
17 of 42
4M Example 4.1: MAKING M&Ms
Motivation

How many M&Ms are needed to fill a bag
labeled to weigh 1.6 ounces?


Copyright 2011 Pearson Education, Inc.
18 of 42
4M Example 4.1: MAKING M&Ms
Method

Data are weights of 72 plain chocolate M&Ms taken
from several packages. To get a measure of the
amount of variation relative to the typical size, we
use the ratio of the standard deviation to the
mean (known as the coefficient of variation).

Copyright 2011 Pearson Education, Inc.
19 of 42
v
s
c
y

4M Example 4.1: MAKING M&Ms


Mechanics

Mean Weight = 0.86 gm
SD = 0.04 gm

C
v
= 0.04 gm / 0.86 gm = 0.0465






Copyright 2011 Pearson Education, Inc.
20 of 42
4M Example 4.1: MAKING M&Ms
Message

Since the SD is quite small compared to the mean
(with a c
v
of about 5%) the results suggest that 53
pieces are usually enough to fill a bag.

A bag labeled 1.6 ounces weighs about 45.36 grams.
Since there is little variability around the typical weight of
an M&M, we can calculate the number of pieces to fill a
1.6 ounce bag as 45.36/0.86.

Copyright 2011 Pearson Education, Inc.
21 of 42
4.2 Histograms and the
Distribution of Numerical Data
Histograms

Plot the distribution of a numerical variable by
showing counts of values occurring within
adjacent intervals

Similar to bar charts but designed for continuous
quantitative data (bar charts are only appropriate
for discrete categories)

Copyright 2011 Pearson Education, Inc.
22 of 42
4.2 Histograms and the
Distribution of Numerical Data
Histogram of Song Sizes


Copyright 2011 Pearson Education, Inc.
23 of 42
4.2 Histograms and the
Distribution of Numerical Data
Histogram of Song Sizes

Indicates a few very long songs (outliers)

The graph devotes more than half of its area to
show less than 1% of the songs (white space
rule: graphs with mostly white space can be
improved by changing the interval of the plot to
focus on the data rather than the white space)
Copyright 2011 Pearson Education, Inc.
24 of 42
4.3 Boxplot
Graph of the Five Number Summary


Copyright 2011 Pearson Education, Inc.
25 of 42
4.3 Boxplot
Combining Boxplots with Histograms

Boxplots locate the median and quartiles
and highlight outliers

The median splits the area of the histogram
in half (unlike the mean, it is resistant or
robust to the effects of outliers)




Copyright 2011 Pearson Education, Inc.
26 of 42
4.3 Boxplot
Boxplot with Histogram of Song Sizes



Copyright 2011 Pearson Education, Inc.
27 of 42
4.4 Shape of a Distribution
Modes

Position of an isolated peak in a histogram

A histogram with one peak is unimodal; two
is bimodal; three or more is multimodal

A histogram with all bars about the same
height is uniform



Copyright 2011 Pearson Education, Inc.
28 of 42
4.4 Shape of a Distribution
Symmetry and Skewness

A distribution is symmetric if the two sides
of its histogram are mirror images

A distribution is skewed if one tail of the
histogram stretches out farther than the
other



Copyright 2011 Pearson Education, Inc.
29 of 42
4.4 Shape of a Distribution
Distribution of Song Sizes

The mode lies between 3 and 4 MB

The distribution is right skewed (the right
tail stretches out farther than the left tail)



Copyright 2011 Pearson Education, Inc.
30 of 42
4M Example 4.2:
EXECUTIVE COMPENSATION
Motivation

What can we say about the salaries of CEOs
in 2003?



Copyright 2011 Pearson Education, Inc.
31 of 42
4M Example 4.2:
EXECUTIVE COMPENSATION
Method

Data consist of the salaries for 1,501 CEOs
reported in thousands of dollars (obtained
from Compustat).



Copyright 2011 Pearson Education, Inc.
32 of 42
4M Example 4.2:
EXECUTIVE COMPENSATION
Mechanics




Copyright 2011 Pearson Education, Inc.
33 of 42
4M Example 4.2:
EXECUTIVE COMPENSATION
Message

The distribution of annual salaries of CEOs
in 2003 is unimodal, nearly symmetric
around the median of $650,000, and right
skewed. The average is $697,000. The
largest salary is $4,000,000.


Copyright 2011 Pearson Education, Inc.
34 of 42
4.4 Shape of a Distribution
Bell-Shaped Distributions and Empirical Rule

A bell-shaped distribution is symmetric and
unimodal

The empirical rule uses the standard
deviation to describe how data with a bell-
shaped distribution cluster around the
mean



Copyright 2011 Pearson Education, Inc.
35 of 42
4.4 Shape of a Distribution
The Empirical Rule


Copyright 2011 Pearson Education, Inc.
36 of 42
4.4 Shape of a Distribution
Standardizing

Converting data to z-scores

Z- scores measure the distance from the
mean in standard deviations




Copyright 2011 Pearson Education, Inc.
37 of 42
y y
z
s

4.5 Epilog
Can 500 different songs fit on the iPod
Shuffle?

Because of variation, not every collection of 500
songs will fit. The longest 500 songs wont fit.
However, based on the typical song size, the
amount of variation in song sizes and the shape
of its distribution, we can say that most
collections of 500 songs will fit!




Copyright 2011 Pearson Education, Inc.
38 of 42
Best Practices

Be sure that data are numerical when using
histograms and summaries such as the mean
and standard deviation.

Summarize the distribution of a numerical
variable with a graph.

Choose interval widths appropriate to the data
when preparing a histogram.


Copyright 2011 Pearson Education, Inc.
39 of 42
Best Practices (Continued)

Scale your plots to show data, not empty space.

Anticipate what you will see in a histogram.

Label clearly.

Check for gaps.


Copyright 2011 Pearson Education, Inc.
40 of 42
Pitfalls

Do not use the methods of this chapter for
categorical variables.

Do not assume that all numerical data have a
bell-shaped distribution.

Do not ignore the presence of outliers.



Copyright 2011 Pearson Education, Inc.
41 of 42
Pitfalls (Continued)

Do not remove outliers unless you have a good
reason.

Do not forget to take the square root of a
variance.

Copyright 2011 Pearson Education, Inc.
42 of 42

You might also like