Professional Documents
Culture Documents
Chapter 1
Examining Distributions
In this chapter, we discuss the basic tools for describing data graphically.
More complex graphing tools will be discussed in Chapter 2.
All graphs are not equal. The basic idea is that the data you have and what
you plan to do with it determine what graph is appropriate. This is
analogous to, for example, a home handyman examining the head of a screw
to decide what screwdriver is needed to take it out. A survey data typically
contain several variables of interest to the researcher. For example, a realtor
interested in house prices in your neighbourhood may collect data on age,
type of dwelling (condo, townhouse, duplex, etc.), number of bedrooms, and
distance to the nearest shopping mall, etc. on recently sold homes in the
area.What price range did most of the house sold at? And is there a
relationship between number of bedrooms and age?
The simplest graphs, histogram, bar graph, pie chart, box plot, and stem
and leaf display, organize data by examining just one variable at a time.
More complex graphs, such as comparative bar graph and scatterplot (to
be discussed in Chapter 2), describe the relationship, if any, between pairs of
variables. You will need to be familiar with the following terminologies
relating to data analysis in general:
Describe the individuals in the study and determine whether the variable of
interest is categorical or quantitative.
Solution: The individuals are the100 restaurants surveyed. The variable of
interest is the cost (in $) of a meal. The variable is quantitative.
2. A mortgage broker wishes to survey Vancouver households to determine
what percent of a households income is spent on housing. Describe the
individuals in the study and determine whether the variable of interest is
categorical or quantitative.
Solution: The individuals are households participating in the survey. The
variable of interest is the percent of the households income it spends on
housing. It is quantitative.
3. A stats student wants to survey Langara students to determine if a
students mode of commuting (by car or motorcycle, by bus, ride a bike,
other) to classes depends on the students employment status (unemployed,
part-time)
Why is employment status a categorical variable?
Solution: Employment status is categorical because it puts all students into
two categories.
Your Turn (for practice at home)
a) Determine what type is each of the following variables:
Age, nationality, weight, hourly wage, mothers occupation, program of
study (Career, UT), number of passengers on a bus.
b) Do Apply Your Knowledge Exercise #1.2 on page 5 and check with the
answers provided below
Solution #1.2 Who (or cases) is the set of students enrolled in the stat
class; the What (or variables) are the 8 variables ID, Exam1, Exam2,
Grade, Id and Grade are categorical; the remaining 6 are quantitative; the
Why (or purpose of the data) is to help the instructor keep track of students
work and to be able to identify students who are falling behind in the course.
c) Additional practice Exercises. Page 21 #s 1.13 &1.15.
10.08%
13.95%
17.08%
15.20%
15.33%
14.87%
We will discuss how to graph data with Statgraphics later. The pie chart
shows that collisions occur almost evenly throughout the week; the fewest
collisions occur on Mondays and the most collisions occur on Saturdays.
Note that a pie chart is suitable for displaying information about individual
categories of a single variable relative to the whole distribution. However, a
bar graph (or Pareto chart) is more flexible than a pie chart. We can use a bar
graph to graph a categorical data even if the categories do not necessarily
represent a single variable. As an example from the text, in a survey of
adults who use several electronic devices or services, we may be interested
Weight
31.7
13.6
20.8
83.0
30.7
19.4
14.2
32.6
8.2
PercentOfTotal
12.5
5.3
8.2
32.7
12.1
7.6
5.6
12.8
3.2
Histogram
A Bar graph or pie chart helps to graph data quickly. But they have limited
use in data analysis because it is easy to understand data for a single
categorical variable without a graph. When the variable is quantitative,
Example 3: Refer to the TBillRates50 data on page 16. Graph the data and
describe the main features.
Solution: We can make a histogram because the variable Rate is
quantitative. We will use 6 classes according to the 2k rule.
We see that although rate varies from a low 1% to a high of 13.8%, most of
the values fall approximately between 3% and 7%. The graph is not
symmetric but tends to be more elongated in the direction of larger values
for rate; it is right-skewed.
Symmetric Curve
Right-Skewed Curve
Left-Skewed Curve
0.4
0.1
0.3
0.08
0.8
0.6
0.06
0.2
0.4
0.04
0.1
0.2
0.02
0
-5
-3
-1
0
0
10
20
30
40
0.5
1.5
2.5
Question: Refer to Example 1.15, page 15. What is the shape of the
histogram for Length of Service Call
Stemplot
Another handy way of organizing quantitative data is a stemplot. A stemplot
is an arrangement of the values into groups (called stems). Values on the
same stem are arranged in order in a row. Stemplot is useful for visualizing
a small size data when the integrity of the values must be preserved.
To make a stemplot:
1. Break up the digits of every value in the data into a stem consisting of
all but the rightmost digit, and a leaf, the last digit.
2. Write all the stem values in a vertical column, with the smallest at the
top and the largest at the bottom, and draw a vertical line at the right
of the column. The line is used to separate the stems and leaves.
3. Write each leaf on the right of the corresponding stem
4. Arrange the leaves on each stem in numerical order.
Technical note: If the values do not all have the same number of digits,
consider rounding off or tagging values having fewer number of digits with
zeros.
Example 4. A financial analysts is interested in determining whether meal
costs at city restaurants differ from meal costs at suburban restaurants. She
collected data from a sample of 50 restaurants in each area.
Meal Costs at City Restaurants
61
50
35
37
29
74
26
45
54
34
43
56
32
41
33
32
67
25
51
27
44
57
74
50
77
44
66
43
76
50
50
80
39
53
61
42
68
55
44
60
44
42
65
77
33
36
48
35
57
43
28
46
70
47
47
29
33
39
39
35
34
54
41
59
51
41
60
44
34
50
52
51
51
71
67
37
56
60
68
36
26
37
49
43
34
27
48
52
34
34
51
34
44
48
31
38
40
39
44
2|5679
3|22334
3|55679
4|1223334444
4|58
5|0000134
5|5677
6|011
6|5678
7|44
7|677
8|0
The stemplot is like a histogram of ungrouped data that has been flipped 90
clockwise, the stems representing a flipped horizontal axis and the leaves
representing the vertical axis. Observe the values to the left of the stems.
These are running totals of the number of leaves. For example, 14 on row 4
is the sum of leaves for stems with values 2 and 3. The number 2 in
parenthesis indicates that the median is in that row.
The stem and leaf indicates that the data is right skewed since values to the
right of the median are more spread out than values to the left of the median.
Yet another useful graphing method for quantitative data is box plot which
we will discuss in the next section.
Describing Distributions with Numbers
In addition to graphs, we use special numbers to describe quantitative data.
These numbers are given the technical name numerical measures. We may
classify them into two groups as follows:
1. Measures of Center: Mean, Median
2. Measures of Spread: Range, Interquartile Range and Standard
Deviation
To understand these measures, we look at the following terminology.
1
xi
n
n
th
2
n
1 th positions.
2
8.2
5.8
4.2
8.0
7.4
5.3
5.8
6.9
9.3
8.6
11.7
5.6
a) Make a stem plot of the data and describe the shape of the distribution.
b) Calculate the mean and median.
c) Which measure of centre in part b) better represents the data? Explain.
Solution:
a) Stem-and-Leaf Display for WaitingTime: unit = 0.1
1
2
6
(2)
7
6
3
1
1
3|8
4|2
5|3688
6|59
7|4
8|026
9|34
10|
11|7
The stem plot shows the distribution is right-skewed (because of the outlier
time of 11.7 minutes).
b) Statgraphics reported that x 7.1 and M 6.9.
Details of the computation for M are as follows:
Arrange the waiting times in numerical order
3.8 4.2 5.3 5.6 5.8 5.8
6.5 6.9
7.4 8
8.2 8.6 9.3 9.4 11.7
Here n =15, odd and
n 1
8. Thus the median M = 6.9, the value in
2
We have seen that in the presence of outliers, the median is a better measure
of centre because it is less sensitive to extreme values than the mean is. The
mean tends to be pulled more in the direction of outliers relative to the
median. If the distribution is right skewed, a small proportion of the data
have relatively large values and will make the mean greater than without the
outliers. If a distribution is left skewed, the few but relatively small values
will make the mean smaller than without the outliers. If the distribution is
exactly symmetric, the mean and median are identical.
Exercise: Do exercise 1.45, page29 and check with answer at the back of
the text.
Measuring Spread.
A measure of the center alone to describe a distribution can be misleading
since two distributions with the same mean can be different in the way their
values are spread out around their mean. A complete numerical description
should include a measure of centre and a measure of spread. We now
examine measures of spread.
To measure the spread in a distribution, you can just determine the
difference between the largest value and the smallest value, called the range.
However, if the distribution is clustered near the center, this measure of
spread will be too big.
We can improve on the range as a measure of spread by using more
intermediate values. It is helpful if we first order the values. If we put 100
values in numerical order, the value in say the 25th position is referred to as
the 25th percentile, the value in the 75th position as the 75th percentile, and so
on. The median is the 50th percentile. In general, the pth percentile of a
distribution is a value K such that p percent of the values is less than or
equal to K and (100-p) percent is greater than or equal to K.
Example 6: Calculate the 75th percentile for the waiting times in Example 5.
Explain what it means in context.
Solution. The stemplot arranged the values in numerical order. There are 15
values so 75% way up will be 15*0.75= 11.25. We interpret this rank to
mean that any value between the 11th and 12th position will work. Since there
is no such value, we use the mean of the two values instead. The values in
positions 11 and 12 are 8.2 and 8.6 respectively. Their mean is 8.4. So the
75th percentile is 8.4 minutes. This means 75% of the customers wait in line
for up to 8.4 minutes and 25% wait 8.4 minutes or longer.
Measuring Spread with Inter Quartile Range (IQR)
The most commonly used percentiles other than the median are quartiles.
The first quartile, denoted by Q1, is the 25th percentile and the third quartile,
denoted by Q3, is the 75th percentile. The quantity Q3-Q1, called the
interquartile range (IQR), shows the spread of the middle 50% of the data.
When the data is skewed, IQR is usually a better measure of spread than the
range of the entire data.
Observe that Q1 is the median of the bottom 50% of the data and Q3 is the
median of the top 50%. So to find Q1 and Q3:
1. Arrange the values in numerical order and find the overall median M.
2. Find the median of the values less than or equal to M. This is Q1.
3. Find the median of the values greater than or equal to M. This is Q3.
Example 7: The ages of 10 randomly selected residents in a seniors home
are the following: 80 55 90 73 75 80 85 92 93 98
Calculate the interquartile range and explain what it means.
Solution: In numerical order the ages are
55 73 75 80
80 85 90
92
93
98
Since there are 10 values, the median M is the mean of the values in 5th and
6th positions. That is, M
80 85
82.5. None of the values is M but of the
2
five values less than M, the median Q1 is 75, and of the five values bigger
than M, the median Q3 is 92. Thus IQR= 92-75=17. This means the ages of
the middle 50% of the residents in the sample have a spread of 17 years.
The Five-Number Summary and Boxplot.
To get a description of the centre and spread of a distribution, we can report
a five-number summary using Minimum, Q1, Q2 , Q3 and Maximum. A
graphical description of the data using the 5-number summary is the box
plot.
Example 8: Obtain a boxplot for the seniors data:
Solution: The five-number summary is Min=55, Q1=75, Q2=82.5,
Q3=92, and Max= 98. Here is the box plot.
Box-and-Whisker Plot
55
65
75
85
95
105
Marks
The line inside the box represents the median; the crosshair represents the
mean. In this case, the mean and the median are almost identical suggesting
that the distribution is roughly symmetric. In fact, the distribution is skewed
to the left since the bottom 50% of the data are more spread out than the top
50%.
Example 9: Obtain a box plot of the waiting times in Example 5 and
describe the shape of the distribution.
Solution. The five-number summary is (check this)
Min=3.8 , Q1=5.6
Q2= 6.9,
Q3=8.6,
Max=11.7
1
( xi x )2
n 1
To get s, add up the squares of the differences of each value and the mean,
divide the result by a normalizing factor and take the square root. That is, s
measures an average spread of the values from their mean. Use a
calculator or software (Statgraphics, Excel, other) to determine s for large
size data.
5|5
6|
6|
7|3
7|5
8|00
8|5
9|023
9|8
The stemplot shows that the distribution is skewed to the left because the
value 55 appears to be an outliers and is cut off from the rest of the data.
(b)
Summary Statistics for FirstExam
Count
10
Average
82.1
Standard deviation 12.5472
Minimum
55.0
Maximum
98.0
Range
43.0
(c) Since the distribution is left skewed, the mean and standard deviation
does not effectively describe the distribution properly. In this case the
median- IQR pair is a better choice for measures of center and spread.
Your Turn
Use the stat keys on your calculator to find the mean and standard deviation
of the fuel efficiency readings (in litres/100 km) for 5 cars:
7.3, 7.2, 7.6, 7.4, 7.5
Do the following for practice. #s 1.59, 1.67 & 1.71. (Use Excel to view the
data and do the calculations by hand)
xx
.
s
82 75
80 72
4.
2 and z p
2
5
That is, put on an equal scale, the physics marks is four standard deviations
higher than its mean whereas the chemistry mark is two standard deviations
higher than the its mean. Thus the student is stronger in the physics class
than in the chemistry class.
Another use of the standard deviation is in predicting outcomes from bell
shaped distributions, as is explained below.
Density Curves and Normal Distributions
A density curve is an idealized mathematical model that describes the
overall pattern of a quantitative distribution (see Figure 1.17, page 41). A
density curve has the following properties:
It is always on or above the horizontal axis
The total area underneath it is 1.
Thus we interpret the area underneath a density curve corresponding to a
specified range of values of the distribution to represent the proportion of the
distribution that fall in that range. For example, the median of a density
curve is the equal-area division point, If the density curve is symmetric, the
median and mean are identical.
Do exercises 1.79 & 1.81 on pages 44-45 and check with the answers at the
back of the text.
In exercise 1.80, note that for the total area under the rectangle to be 1, the
height of the rectangle must also be 1. In part (b), lie above and are
greater than have the same meaning. In part (c), lie below and are less
than also have the same meaning. Complete the rest of exercise 1.80.
Normal Density Curves
Normal density curves are a family of curves having the characteristics of
being symmetric, single-peaked and bell-shaped. All normal density curves
have the same overall shape but different centre and spread. The exact
distribution of a particular normal density curve curve may be determined
completely by specifying the mean which is located at the center, and
standard deviation . Being a density curve, the total area under a normal
curve is 1.
Normal density curves are important in statistics for many reasons, three of
which are the following. First, normal density curves are good descriptions
of many real life data such as the distribution of marks on a test taken by a
large number of people; or the actual weight of cereal in a 800 gram box).
Second, a normal density curve is a good description of many chance
outcomes such as the distribution of the proportion of heads in tossing a coin
many times, the heights of individuals belonging to a given age group.
Thirdly, as we will see later, normal density curves play a pivotal role in
drawing conclusions about a population based on sample.
The 68-95-99.7 rule (see page 46)
All normal density curves obey the 68-95-99.7 rule. The rule states that if
a density curve is Normal with mean and standard deviation , then
68% of the observations fall within of ; that is in the interval
( , )
Example 13. The actual content of an 800 gram box of cereal varies from
box to box and can be modeled by a Normal density curve with mean
805 grams and standard deviation 4 grams. The 68-95-99.7 rule says
that the middle 68% of boxes contain between 801 grams and 809 grams
cereal; the middle 95% of the boxes contain between 797 grams and 813
grams of cereal; and the middle 99.7% of the boxes contain between 793
grams and 817 cereal.
1
(68% 95%) 81.5%.
2
813 grams is two above the mean and 809 grams is one above the
mean. Thus according to the 68-95-99.7 rule, the proportion of boxes
1
2
Example: Suppose the variable x has a N(10,2) distribution. What is the zscore for the value x=15?
Solution. Here 10, 2 so if x =15, then z
15 10
2.5.
2
680 500
1.8
100
x 27 18
Geralds standardized score on the ACT is z
1.5
Eleanor had the higher score because on the standardized test, his score is
higher and both tests measure the same kind of ability.
600 572
0.55
51
From the standard normal table on page T-2, the area under the curve to the
left of z=0.55 on the z curve is 0.7088; so area above z=0.55 is 10.7088=0.2912. That is, about 71% of the students scored lower than or
equal to 600 on the ISTEP and about 29% scored higher than 600.
x 572
51
for the From Table A, z-score for the first quartile is approximately -0.67.
Thus we solve for x in -0.67=
x 572
and get 537.83.
51
Your Turn:
1. Find the minimum score for the top 10% of the ISTEP scores.
2. Practice Exercises: Exercise Set 1.3, page 60:
a) # 1.111