You are on page 1of 120

MTH 220:

Introduction to Probability and


Statistics

CASE STUDY MANUAL

Spring 2015
Stephen F. Austin State University
MTH 220: Introduction to Probability and Statistics
Statistics Faculty

Beginning in the Fall semester of 2013, the statistics faculty at Stephen F. Austin State
University began a comprehensive review of the experience provided in the introductory
statistics course: MTH 220. This Case Study Manual represents the gathering together of several
key data sets that form the backbone of the current course. Subject material is presented in
context. The case studies included in this document were collected by the faculty listed below
from a variety of sources. All of the data presented in this manual is real – not realistic. That is,
every data set presented here represents actual information collected by experimenters and
researchers with the aim of problem solving using statistical inference.

Many statistics faculty contributed to the creation of this document either by writing, editing
and/or contributing case studies. All of the faculty listed here actively participated in a review of
all course materials and experiences in committee format for two years before the completion of
this document.

Greg Miller, PhD Professor of Statistics


Kent Riggs, PhD Associate Professor of Statistics
Bob Henderson, PhD Assistant Professor of Statistics
Kalanka Jayalath, PhD Assistant Professor of Statistics
Shelley Cook, M.S. Lecturer, Mathematics & Statistics
Robin Sullivan, M.S. Lecturer, Mathematics & Statistics
Stephanie Weatherford, M.S. Adjunct Lecturer, Mathematics & Statistics
TO THE STUDENT
The primary source of information for your introductory statistics class are the notes, slides,
handouts, materials and words presented during class hours by your instructor. There is no
substitute for the experience afforded to you by attending class. This document and/or
supplemental books and online resources are not a replacement for class materials provided
during lecture meetings. You must go to class. Without the experience of being in class you
cannot participate in discussions, small group situations or see your instructors’ mannerisms or
hear his/her voice inflection. All of these things tend to be more important than what
introductory statistics students realize.

This Case Study Manual is the most important document for MTH 220. It orchestrates the flow
of your course and provides all of the essential ingredients in a framed format. Again, the
details, examples and illumination within this frame comes through classroom experiences. The
Case Study Manual is more central to the course than the supplementary text or online
homework system. Without a doubt, this entire manual should be read and repeatedly studied
throughout the semester. In many cases, your instructor will assign you to read pages from this
manual. However, you should consider this assignment already made at the time your read this
preface.

Beginning in 2013, the statistics faculty began thinking (re-thinking, actually) deeply about the
relevance of a first course in statistics. We decided that it was best to always have you – the
student – in problem solving mode. Hence, the “case study” approach was adopted. At all
points during the semester you will be playing the role of detective while trying to solve the case
or come to some resolution of the key question. At no point are you on the outside of a problem
looking in. Instead, you will always be immersed in some part of problem solving.

The cases presented in this manual originate from the fields of psychology, medicine, education,
business, law, health and sociology. This list is not meant to be exhaustive of the fields making
use of statistical science. Indeed, statistics is a service discipline. It provides a partnership with
whatever major you are currently pursuing at Stephen F. Austin. Virtually all professions in
modern society make use of data. More than ever before, data envelops you. It is more readily
obtained, stored and processed than ever before. That fact begs for the proficiency in data
analysis and visualization to increase among the educated populous. That’s why you are here: to
gain an appreciation for the power of data analysis, probability, statistics and the decision
making process. We think the best way to have you acquire this appreciation is to allow you to
be a part of discovering a solution to a problem at all points throughout the semester.

We hope you enjoy your introductory statistics course. We also hope that this manual –and your
course in general - provides you with definitive evidence for the power and applicability of
statistics. We hope you gain an appreciation for how sound data-driven decision making can
affect your life and career path in a positive way.

Greg Miller
Lead Author, MTH 220 Case Study Manual
January 2015
TABLE OF CONTENTS
CASE STUDY #1: Postponement of Death Theory 1
Statistical Inference for a Single Proportion

Case Question 1A 1
Population and Sample 1
Random Variable, Parameter and Statistic of Interest 2
Hypothesis to be Tested 4
Test Statistic and Sampling Distribution 4
The Binomial Distribution and Elementary Probability Laws 6
Case Question 1A Concluding Decision 10
Concepts and Terms to Review from Case Question 1A 12

Case Question 1B 12
Population and Sample 13
Random Variable, Parameter and Statistic of Interest 13
Hypothesis to Be Tested 13
Sampling Distribution 13
The Normal Approximation to the Binomial Distribution 16
Characteristics of Normal Random Variables 17
Calculations With Normal Random Variables 19
Approximation to the Sampling Distribution and P-value Calculation 22
Case Question 1B Concluding Decision 24
Concepts and Terms to Review from Case Question 1B 24

Statistical Follow-Up to Case 1B: An Introduction to Confidence Intervals 24


Concepts and Terms to Review from the Follow-Up to Case Question 1B 28

CASE STUDY #2A: Rapid Aging In Children 29


Statistical Inference for a Population Mean: One Sample t-test

Introduction 29
Case Question 2A 29
Population and Sample 30
Random Variable, Parameter and Statistic of Interest 31
Descriptive Statistics For Measurements on a Single Quantitative Variable 31
Hypothesis to Be Tested 40
Test Statistic and Sampling Distribution 41
Characteristics of t Random Variables 44
An Additional Consideration in Hypothesis Tests: Type I and II Errors 45
Case Question 2A Concluding Decision 47
Confidence Interval For a Population Mean Based on the One-Sample t-test 48
Concepts and Terms to Review from Case Question 2A 51
CASE STUDY #2B: GPA of Students in Introductory Statistics 52
Large Sample Statistical Inference for a Population Mean: The Central Limit Theorem

Introduction 52
Case Question 2B 52
Random Variable, Parameter and Statistic of Interest 53
Hypothesis to Be Tested 54
The Benefit of a Large Sample 56
The Central Limit Theorem 57
Test Statistic and Sampling Distribution 62
Case Question 2B Concluding Decision 62
Confidence Interval For a Population Mean Based on the Z Null Distribution 64
Concepts and Terms to Review from Case Question 2B 66

CASE STUDY #3A: Sleep Deprivation in Young Adults 67


Statistical Inference for Two Population Means: The Two Independent Sample t-Test

Introduction 67
Case Question 3A 67
Population and Sample 68
Random Variables, Parameters and Statistics of Interest 69
Hypothesis to Be Tested 69
Test Statistic and Obstacles to Finding an Exact Null Distribution 70
The Need for Pooling 74
Null Distribution for the 2-Sample test on Means Assuming Common Standard
Deviations 76
Case Question 3A Concluding Decision 77
Confidence Interval for the Difference in Two Population Means Based on the
Two Sample t-test 80

Statistical Follow-Up to Case 3A: Large Sample Test For the Difference in
Population Means 82

Statistical Follow-Up to Case 3A: The Problem of Unequal Population


Standard Deviations 83
Concepts and Terms to Review from Case Question 3A 84

CASE STUDY #4A: Revenue From Sugar Cane 85


Correlation and Regression

Introduction 85
Case Question 4A 85
Population and Sample 85
Random Variables of Interest 86
Scatterplots: A Graphical Descriptive Statistics Tool Used in Correlation &
Regression Problems 87
Parameters of Interest 89
A Statistic Useful in the Study of Correlation 90
The Line of Best Fit: What Do We Mean By “Best”? 93
The Line of Best Fit: Method of Least Squares 94
A Confidence Interval for the Slope 97
Case 4A Concluding Decision 101

Statistical Follow-Up to Case 4A: Hypothesis Test for the Slope 102
Concepts and Terms to Review from Case Question 4A 104

CASE STUDY #5A: Should You Admit Your Guilt? 106


Statistical Inference for Two Proportions

Introduction 106
Case Question 5A 106
Populations and Samples 106
Random Variables, Parameters and Statistics of Interest 107
Hypothesis to Be Tested 110
Development of a Test Statistic and the Central Limit Theorem Revisited 110
Case Question 5A Concluding Decision 113
Concepts and Terms to Review from Case Question 5A 114
Case Study #1: Postponement of Death Theory
Introduction

If an event of particular significance is upcoming in one’s life, is it within an elderly person’s


ability to actually postpone death until the event has passed? For instance, each birthday can
hold greater and greater significance for the aged. We often have special celebrations for those
who are fortunate to live 80, 85, 90, even 100 years old. Is it possible that the anticipation of
such celebrations can be incorporated into a “will to live” for those that are very old? This
“postponement of death theory” is the context for our first case study. Several data sets will be
explored in this first case, but consider the following scenario for our first inquiry:

Case Question 1A

During the last week of May 2014, there were 18 published obituaries in the
Lufkin/Nacogdoches newspapers for persons who were over 70 years old when they died. If
there is any true significance attached to the postponement of death theory, then one would
expect an increased rate of death shortly after a significant date – such as a birthday. Of the 18
obituaries, 6 people died within three months after having a birthday. Is the fraction observed in
this sample suggestive of the postponement theory?

Population and Sample

During the first few class meetings of the semester we discussed that the words population and
sample are key in statistical studies. For Case Question 1A,

 The population of interest is taken to be all people in the Lufkin/Nacogdoches area that
were over 70 years old when they died.

 The sample consists of 18 published obituaries in the two main local newspapers for one
week in late May. All 18 people in the sample were over 70 years old when they died.

Recall, to make sound statistical decisions based on collected data, our sample needs to be as
representative of the population of interest as possible. Discuss and think on the following
questions:

 Is it believable that the collected sample is a random sample?


 Do you think the sample collected is representative of the population of interest listed
above?
 What difficulties might we face in collecting a random sample? Could the population be
sampled in a different way that might be better than what was done in Case Question 1A?

You may very well have doubts about the representative nature of the collected sample. Despite
this, we often are faced with having to work with samples such as this one under the umbrella of

1|Page
disclaimers. We issue such a disclaimer – or assumption – now: the following analysis will
assume the 18 published obituaries are representative of the targeted population. If this
assumption could be proven to be grossly untrue, then any statistical significance that we may
attach to the results could be in question.

It is important to question the representative nature of samples. It is also important before data
collection to consider the best possible way that we could reasonably sample our population.
Despite such considerations, statisticians often have to work with less than perfect samples. This
is just a realistic feature of data analysis. So, caution is advised when working with samples that
could be argued to be unrepresentative. In the current case, we may have some doubts, but they
may not be so grave as to nullify all that follows.

Random Variable, Parameter and Statistic of Interest

Go back and carefully read how the data is presented in Case Question 1A. What we know
about the 18 people that died is whether or not they died within three months after having a
birthday. Six people passed away during the three months which followed a birthday. The other
12 did not. The “information” that each person contributes to the sample is essentially either a
“yes” or a “no”:

 Yes (Success): The person did die during the three months which followed a birthday.
 No (Failure): The person did NOT die during the three months which followed a
birthday.

Data such as this is said to have come from a Bernoulli Trial. Bernoulli trials are investigations
in which the resulting data has only two possible outcomes. We call the two possible outcomes
“success” and “failure” even though there is no positive or negative connotation attached to those
two words. Bernoulli trials result in “1’s” and “0’s”. The ones are attached to the outcome
“success” and the zeroes are attached to the “failures”. Label the “success” as the feature you are
interested in studying. In this case, a success will be associated with death during the three
months following a birthday. Data recorded as yes/no, up/down, left/right, on/off, in/out, etc. are
examples of Bernoulli trials.

We have 18 Bernoulli Trials in Case Question 1A. Imagine that each person in the sample was
assigned either a “1” or a “0” based on whether or not they were a successful Bernoulli trial or
not. This type of assignment is an example of a random variable.

 A random variable is an assignment – one that obeys the laws of what in mathematics
we call a “function” – in which each experimental result is assigned a meaningful
number.

For instance, one person from the list of obituaries died on May 29, 2014 and they had just had a
birthday on May 13. The person is assigned a “1” – a success. Random variables are often
given notation such as X, Y or Z. Sometimes if we the same random variable is observed
multiple times, we use subscripts in our notation, such as X1 , X 2 , X 3 , etc. For us, let’s denote

2|Page
our random variable of interest as X, where X is either a “1” or a “0” based on whether or not the
person from the list of obituaries was assigned a success or a failure when considering them as a
Bernoulli Trial. We have 18 obituaries, so our random variables will be denoted X1 , X 2 , X18 .
Specifically,
1 if the i th person is a "success"
Xi   th
,
0 if the i person is a "failure"

i  1, 2, 18 . Notice that our random variable has a finite number of possible outcomes (only
two). Random variables that have a finite or countably infinite number of possible outcomes are
called discrete. Some random variables have an uncountably infinite number of possible
outcomes. These type random variables are called continuous. We will contrast discrete and
continuous random variables more later, but for now, the important thing is that you know the
definition of each type.

Recall from the first week of class that parameters are numerical characteristics of populations
while statistics are numerical characteristics of samples. In our population of all people in the
Lufkin/Nacogdoches area that were over 70 years old when they died, the parameter of interest is
the proportion of people that died during the three months which followed a birthday. This
proportion is of extreme importance. All the statistical analysis contained in Case Study #1 is
associated with a population proportion. Do not forget this. We will denote the population
proportion by p:

Let p = the proportion of all people in the Lufkin/Nacogdoches area that were over 70 years old
and died during the three months which followed a birthday.

Notice how specifically the parameter is defined. This is important. Loosely describing
parameters and statistics can lead to confusion and improper interpretation. You should strive in
your own problem solving to always list the parameter of interest and to very specifically
describe it in writing.

A statistic is a feature of a sample. In our sample, we know the proportion of people that died
during the three months which followed a birthday. This value was given to us in the statement
of the Case Question.

Let p̂ = the proportion of people in our collected sample that were over 70 years old and died
during the three months which followed a birthday.

Do not read beyond this point until you see the difference between p and p̂ . The value of p is
unknown. Despite this, we will try to reach a conclusion about p. We will use the value of p̂
(which is known) to infer something about p. The statistic p̂ is called the sample proportion.
In other words, we will use the statistic ( p̂ ) to estimate the parameter (p) and ultimately, we
will use other features of the statistic to formulate a final conclusion about p. This process of
using a statistic and its features to draw a conclusion about a corresponding parameter is called

3|Page
statistical inference. In Case Study #1, we want to make a statistical inference about a
population proportion, p.

Hypotheses To Be Tested

So, just what conclusion are we trying to reach about p in Case Question 1A? Well, we are
trying to see if the postponement of death theory is substantiated by our data! That’s our main
inquiry. We should believe this theory is substantiated ONLY if our data indicates it. We
should place the burden of proof on the data and begin our investigation with the hypothesis that
the postponement of death theory is not suggested by the data. If an investigator is trying to
establish the relevance of the theory, he or she should NOT start off assuming the theory is true.
That would be terrible science. Instead, we should test the theory by taking data and then letting
the data tell us if the weight of evidence is so large that the theory is believable.

Now, if the postponement theory isn’t suggested by the data, then it would make sense that the
deaths occur randomly throughout the year. Since we are looking at the three months after a
birthday and three months is one quarter (or 25%) of a year, then it makes sense that if the
postponement theory isn’t correct for our population, tat p = .25. That is, if the postponement
theory isn’t correct for our population, the fraction of deaths in the three months following a
birthday is 25%. This is called the null hypothesis in our statistical investigation. We denote it
like this:

H 0 : p  .25 (the population proportion is 25%)

On the other hand, if the postponement theory is substantiated, then this would mean that it
would be more likely for a death to occur right after a birthday. Specifically, if the
postponement theory is relevant for our population, then we would expect more than 25% of
deaths to occur in the three months following a birthday. Proponents of the postponement theory
want us to believe this. The burden of proof is on them to show that the data suggests that their
hypothesis is indeed correct. We will call the hypothesis that is trying to be substantiated the
alternative hypothesis. We denote it like this:

H A : p  .25 (the population proportion is greater than 25%)

Test Statistic and Sampling Distribution

We’ve previously noted that we want to use the statistic p̂ in order to make a statistical
inference about p. We want to use p̂ in order to test H 0 vs. H A . Now, from the information
given in Case Question 1A, we can easily calculate that pˆ  6 18 . Of course, to the nearest
percentage point, this means that pˆ  33% . The question is this: Is the truth that pˆ  33%
sufficient enough evidence to reject H 0 and claim that the postponement theory is substantiated?
Should we reject H 0 and decide to believe H A on the basis of the statement that pˆ  33% ?

4|Page
Well, it is true that pˆ  .25 . But, 33% may not seem much larger than 25%. Additionally, p̂
was calculated on the basis of only 18 deaths. So, we have SOME evidence for H A on the basis
that pˆ  .25 . But, is this evidence convincing enough to reject H 0 and claim that the
postponement theory is substantiated? If so, we would call the evidence presented by the sample
statistically significant. We should immediately state that just because evidence is statistically
significant does not mean that the results are biologically significant or environmentally
significant or psychologically significant, etc. We will discuss this point more later. Suffice it to
say, that the statistical significance or non-significance of a result should always be cross
examined with the significance from other perspectives. But, this is a course in statistics – so,
we are learning what it means for results to be statistically significant.

In order to know whether or not pˆ  6 18 is a statistically significant result that is indicative of


H A , we must ask ourselves two very important questions:

 What other values of p̂ could we have obtained in other potential samples?


 How likely are the these values of p̂ if the null hypothesis is true?

In particular, we need to know just how likely it is to observe pˆ  6 18 if the postponement


theory is not true for our population. If we could calculate that pˆ  6 18 is a very likely
occurrence under the assumption that the postponement theory is not true, then our sample has
indicated evidence to retain H 0 . That is, we wouldn’t reject it based on the observed data and
the results gathered from the sample are not statistically significant in regards to the
postponement theory.

However, what would it mean if we were to calculate that pˆ  6 18 is very unlikely if H 0 is


presumed correct? It would mean that pˆ  6 18 would be a very rare event to observe, if in fact,
H 0 is legitimate. This would make us wonder why, if H 0 is correct, did we see a rare
occurrence? This rare occurrence would cast doubt on the truth of H 0 and would ultimately lead
to H 0 ’s rejection in favor of H A .

The above paragraphs are motivation for what statisticians call a p-value. The concept of p-
value is one of the most important ideas in all of statistical science. We will repeatedly use this
concept and calculate p-values. Your understanding of them will round into form as we progress
through more and more case studies. For now, reflect on the logic of the above paragraphs and
the following definition.

p-value: The chance of observing the value of the statistic from your sample (or one more
extreme) if, in fact, the null hypothesis is true.

Again, this definition and associated calculation will be reinforced all throughout the course.
Learn the definition now as your instructor will surely utilize the concept of p-value repeatedly.

5|Page
If we answer the questions in the two bullets above, then we will have all the information
necessary to calculate a p-value. Once we have calculated a p-value, we will only be one step
away from making a conclusion about Case Question 1A. Also, if we answer the two questions
posed in the bullets we will also be able to construct what is called a sampling distribution.
Sampling distributions can be used for hypothesis tests as well as other statistical procedures that
we will see in subsequent cases. In fact, one really can’t do much of any statistical inference
about population parameters without sampling distributions.

sampling distribution – a description or list of the possible outcomes of a statistic along with
the likelihoods of these outcomes.

Sampling distributions describe the behavior of a statistic across repeated samples. In general,
we will only collect one sample. But, knowledge of the sampling distribution will allow us to
compare our sample with the other possible samples that we could have seen! This is powerful.
This is what will allow us to know whether we’ve seen a statistically significant result – by
comparing our statistic to other theoretical statistics that we might have seen. In this way, we
will know whether what we observed is rare or common place.

Go back and look at the boxed definition of sampling distribution above. For our current
problem, we could already complete one part of the sampling distribution for p̂ . We could
certainly at this point make a description or list of the possible value of p̂ . They are:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
, , , , , , , , , , , , , , , , , , .
18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18

The sampling distribution for p̂ will be complete if we can calculate the “likelihoods of these
outcomes”. Admittedly, calculating these likelihoods is the more challenging part of
constructing any sampling distribution.

The Binomial Distribution and Elementary Probability Ideas

Recall our random variable


1 if the i th person is a "success"
Xi   th
.
0 if the i person is a "failure"

Our statistic, p̂ can be written pˆ  i 1 X i n , where we know that the total number of Bernoulli
n

trials is n  18 . The information provided in Case Question 1A provided us with the observed
value of  i 1 X i which is 6. That is, we know the total number of successes is six in our sample
18

of n  18 total trials.

A set of Bernoulli trials is called independent if the outcome of any one trial (success or failure)
doesn’t alter the likelihood of success or failure on any other trial. It seems appropriate to label
our 18 Bernoulli trials as independent when considering that the date of one person’s death

6|Page
doesn’t generally alter the date of someone else’s death. Now, there are unfortunate situations
where there are simultaneous deaths of several people due to an accident of some sort, but there
was no indication of this among the 18 obituaries in the sample.

A set of Bernoulli Trials is said to have come from a binomial experiment if


 There are n independent Bernoulli trials all of which have the same likelihood of
resulting in success
 The experimenter has interest in the total number of successes among the n trials.

The data collected from the obituaries can reasonably be assumed to have come from a Binomial
experiment:

 We have n  18 independent Bernoulli trials. The time of each death that determines
whether the trial is classified as a success or failure can reasonably be assumed to be
independent from person to person. If the null hypothesis is true, then the chance of each
trial resulting in a success can be claimed to be p  .25 . This likelihood is reasonable to
apply to each trial.
We are interested in the total number of successes since our statistic, pˆ  i 1 X i n ,
n

utilizes this total.

Let Y   i 1 X i . Then, from the above argument we state that Y is a binomial random
n

variable with parameters n  18 and p  .25 . A binomial random variable counts the total
number of successes in a binomial experiment. The probabilities associated with binomial
random variables can be obtained from the formula for the binomial mass function. All discrete
random variables have mass functions and it is the job a mass function to provide the
probabilities associated with all possible outcomes of the random variable. The formula for the
binomial mass function is

n!
p y 1  p  .
n y

y ! n  y !

This formula gives the chance of exactly y successes among the n trials. So, if Y is a binomial
random variable with parameters n and p, then we denote the “probability of exactly y successes”
by the notation P Y  y  . Here, the capital P denotes “probability”, the capital Y is the random
variable of interest and the lowercase y is the particular numerical outcome of the random
variable that is relevant in the calculation. The exclamation point (!) indicates the operation of
“factorial”, which is simply the successive product of all integers from the value preceding the
symbol down to 1. For instance, 6!   6  5 4  3 2 1  720 .

Applying the binomial mass function to our case study, suppose Y is a binomial random variable
with parameters n  18 and p  .25 . Then, the chance of exactly 6 successes among the 18 trials
is

7|Page
18!
P Y  6   .25 .75  .1436 .
6 12

6!12!

Since the binomial mass function is quite prevalent in probability and statistics, it is a good idea
to get used to making calculations by hand using the formula. However, after a bit of practice,
the popular software Excel (as well as other computer programs) can make the calculations more
expeditious. The binomial mass function in Excel can be invoked by clicking in any particular
cell in the spreadsheet and then typing the following command:

 BINOM.DIST  y, n, p, FALSE 

For instance, clicking in a cell in Excel and typing  BINOM.DIST  6,18,.25, FALSE  and then
hitting the “Enter” key will produce the value .143564 in the cell. Replacing the word “FALSE”
with the word “TRUE” will calculate the cumulative probability P Y  y  , rather than the
individual probability P Y  y  . Try it for our case study and you’ll see that
P Y  6   .861015 .

Looking back at the binomial mass function we can see that it is made up of three parts:

n!
 counts the number of ways in which we could observe y successes among n
y ! n  y  !
n n!
trials. Often, we use the symbol   to denote and this symbol is read “n
 y y ! n  y  !
choose y”. If you have n distinguishable objects, and you want to choose y of them to
put in a set, then the number of ways to do this is “n choose y”.
 The second piece of the binomial mass function is p y and it incorporates the fact that we
need exactly y successes each of which has probability p.
The final piece of the binomial mass function is 1  p  and it incorporates the fact that
n y

if we need exactly y successes, then this means that there are n  y failures. Also, since
the chance of a success is p, them the chance of failure is 1 p .

This last point introduces the important complement rule of probability. For our relevant
binomial random variable, notice that the list of possible outcomes for Y is 0,1, 2, ,18 . We
will denote this complete list of possibilities as S and write S  0,1, 2, ,18 . The calculation
P Y  6   .861015 only involves a subset of S; namely, E  0,1, 2,3, 4,5,6 . The set E is an
example of what is called an event. The complement rule of probability states that for any event
E, we have that P  not E   1  P  E  .

In our example, “not E” consists of the following values:

8|Page
not E  7,8,9,10,11,12,13,14,15,16,17,18 .

The chance of this event, by the complement rule is 1-.861015=.138985. Clearly, another way to
write not E is Y  7 . So, the complement rule tells us that P Y  7   .1390 . Notice, for future
use, that it would seem to make sense that P Y  6  P Y  6   P Y  7  . That is, to get
P Y  6  , just add in the one additional probability to the calculation already made. Doing this
would give us the result that P Y  6   .1436  .1390  .2826 . The legitimacy of this
calculation stems from the axioms of probability, of which there are three. The mathematical
subject of probability, like geometry, is based on a set of axioms which are regarded as universal
truths needing no proof. Many scientific theories or branches of science begin with the scientific
community accepting some set of axioms. If you accept the axioms, you will accept the
theorems and results which stem from them. If you don’t accept the axioms, then the entire
branch of science may be held in question. Deciding on axioms has historically been
challenging, controversial and time consuming. Scientists didn’t generally come to an agreement
on the “probability axioms” until 1933.

Looking back at our calculations, it makes sense that for any event E, we should claim that
P  E   0 and that for the event S we should claim that P  S   1 . These are, in fact, the first
two axioms. First, probability is never negative. Second, we are certain to observe one of the
possible outcomes in the set that exhausts all the potential possibilities. The third and final
probability axiom is the one that posed the biggest challenge to scientists in terms of its
acceptance.

A few paragraphs back, we looked at the event that just contained a single value, 6 . Notice
that the value “6” does not appear in the event not E  7,8,9, ,18 . When two events (or sets)
don’t share any values (elements) in common, we call them mutually exclusive. When we have
a group of events and each and every pair that could be chosen from the group is mutually
exclusive, then we call the entire group of events pairwise mutually exclusive. The third axiom
of probability says that if two events, say A and B, are mutually exclusive, then P  A or B  
P  A  P  B  . This is the axiom that we used to compute P Y  6   .1436  .1390  .2826 .

But actually, the third axiom of probability applies to more than two sets. In fact, it applies to
any number of sets that are pairwise mutually exclusive (that was the part that took so long to
agree upon historically). It is of upmost importance that you realize that the third axiom applies
ONLY to events that are pairwise mutually exclusive

Axiom 3: If E1 , E2 , E3 is a collection of pairwise mutually exclusive sets (all of which are


subsets of larger set S that contains all possible outcomes of the experiment in question), then

P  E1 or E2 or E3 or   P  E1   P  E2   P  E3  

9|Page
Make sure you realize that the ellipsis, “ ” means that we could have three events, ten events,
one thousand events, etc. It doesn’t matter. As long as all the events in question are pairwise
mutually exclusive, Axiom 3 applies. But, it ONLY is good for calculating probabilities that
involves the notion of “one set or another or another, etc.”. If you need to calculate P  A and B 
then we may need other probability rules that will emerge later.

Case Question 1A Concluding Decision

The question in Case Study 1A can be answered by using the development presented to this
point. The work above has outlined the statistical procedure known as a hypothesis test for a
proportion. We have two competing hypotheses and the data will let us know which is most
plausible. As a recap, our two hypotheses are

H 0 : p  .25 (the population proportion is 25%)


H A : p  .25 (the population proportion is greater than 25%)

The focal question is regarding the level of evidence for the postponement of death theory.
Since this theory is under scrutiny, we should not presume it is true. We should assume it is
false until the data suggests otherwise, if at all. This is why H 0 and H A are set up in the manner
that they are. What the experimenter is hoping to establish is generally placed in the alternative
hypothesis.

The parameter being tested is p, the population proportion. It was estimated by the sample
proportion pˆ  6 18 . In order to know whether or not this result is suggestive of rejecting H 0 ,
we needed to obtain a sampling distribution for our statistic. The sampling distribution describes
the behavior of p̂ in repeated samples. Even though we are unlikely to observe these theoretical
“other” samples, knowledge of the sampling distribution provides a context for our observed
value of pˆ  6 18 . When we assume H 0 is true, the distribution of a statistic used in a
hypothesis test is more specifically called the null distribution. The relevant null distribution
for Case Study 1A is the binomial distribution with n  18 and p  .25 .

Notice that large values of p̂ - and in turn – large values of Y   i 1 X i are indicative of
n

rejecting H 0 and deciding H A is the most relevant hypothesis based on the evidence in the data.
Since large values indicate rejection, the p-value associated with our hypothesis test is the chance
that we observe pˆ  6 18 or a value larger. We calculate this chance using the null distribution.
If you take a moment to reflect on the definition of p-value, you will recall the phrase “or more
extreme” appears. Here, “more extreme” is interpreted as “larger than”. In general, “more
extreme” is interpreted in light of the alternative hypothesis (notice the “>” sign in H A )

So, our p-value is P  pˆ  6 18 . Of course, the only way that p̂ can be greater than or equal to
6 is if Y  6 . We make this calculation using the null distribution, which is binomial with

10 | P a g e
n  18 and p  .25 . We know from previous work that this probability is P Y  6   .2826 .
Now, we are at the point of decision. The p-value is a percentile. Specifically, the p-value we
obtained is the upper 28th percentile of the null distribution. This means that if H 0 is true, we
can expect to see our sample proportion or one larger 28% of the time. Does this 28% seem
overly rare to you? Probably not. While we all inherently have different personal interpretation
of the word “rare”, statisticians that are involved in statistical inference problems generally don’t
label p-values as rare until they fall in the lower or upper 5%-10% of the null distribution. This
will be discussed more in other case studies.

Since our p-value of 28% isn’t particularly rare, we don’t particularly have overwhelming
evidence to reject the null hypothesis. You can think of a p-value as a barometer of sorts. The
lower the p-value, the more evidence exists in the data to reject H 0 and instead, conclude that
H A is most statistically reasonable. If the p-value isn’t low, then you have observed data that is
common place if in fact H 0 is true. This is what has happened to us. We didn’t observe a low
p-value, so we don’t have sufficient evidence to reject H 0 . So, since we don’t reject H 0 , we
will retain it as the most reasonable conclusion. Therefore, the we retain the null hypothesis that
p  .25 . That is, the claim that the population proportion is 25% can’t be refuted. It appears we
do NOT have sufficient data-driven evidence to believe the postponement of death theory in our
population.

For a visual look at our null distribution and p-value, look at the following chart made in Excel.

0.25
P
r
0.2
o
b
a 0.15
b
i 0.1
l
i 0.05
t
y
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Possible Values of Y

Can you tell where the “rare” values of Y are from this picture? If the null hypothesis is true, it
would be very rare to observe Y  11 . In fact, Excel calculations can show that P Y  11 is
less than 1%. The probabilities for Y  11,12, ,18 are so small that Excel plots them right on
the horizontal axis. These probabilities aren’t actually zero, but they are so close that the plotting

11 | P a g e
symbol is right on the axis. The vertical line represents the value of Y that we observed in our
sample. Look at it in the context of the entire null distribution. Does it appear in the “rare” –
what are called “tails” - of this distribution? No. So, the null hypothesis wasn’t rejected.
However, if 11, 12 or more of the 18 obituaries had been during the three months which
followed a birthday, then our data would have been suggestive of the postponement of death
theory. In these cases, the p-value would have been less than 1% - very rare. If we had observed
that rare of an event, we would have been forced to wonder why we saw something so rare if in
fact H 0 is true. Observing this rare event would have led to a change in what we could presume
reasonable. It would have led to our abandoning H 0 and concluding the data is indicative of
H A instead. The lower the p-value, the more evidence we have for H A .

Concepts and Terms To Review from Case Question 1A:

Population Statistically significant


Sample p-value
Bernoulli Trial sampling distribution
Random Variable independent trials
Discrete random variable Bernoulli Trial
Continuous random variable Binomial random variable
Parameter Binomial mass function
Statistic Cumulative probability
Population proportion Complement rule
Sample proportion Event
Statistical inference Axioms of Probability
Estimate Mutually Exclusive
Test of hypothesis Pairwise Mutually Exclusive
Null hypothesis Null distribution
Alternative hypothesis

Despite the small data set from Lufkin/Nacogdoches not substantiating the postponement of
death theory, the concept has garnered considerable attention throughout the world. The idea
that someone can postpone their death until the passing of a birthday, family reunion or
reconciliation with a family member have all been studied and published reports on each exist in
scientific literature. One of the most discussed papers on this subject is Phillips (1972), “Death
and Birthday: An Unexpected Connection”. In his essay, Phillips looked at obituaries published
in a newspaper from Salt Lake City, Utah. His data structure was slightly different than that of
Case Question 1A. Phillips looks at the percentage of people that died in the three months prior
to their birthday. Among 747 deaths, only 60 of them fell within this three month window. If
deaths are occurring randomly during the year, we would expect 25% of people to pass away in
the three month prior to their birthday. Notice that 60 of 747 is only 8%.

Case Question 1B

12 | P a g e
A total of 747 obituaries were collected from a Salt Lake City newspaper. These obituaries were
scattered throughout the period of one year. Among these 747 obituaries, only 60 people died in
the three months prior to their birthday. Is this decrease in what would have been expected if the
deaths were randomly occurring throughout the year statistically significant? That is, do the Salt
Lake City obituaries provide statistical evidence for the postponement of death theory?

Population and Sample


On the surface, Case Question 1B is quite similar to Case Question 1A. The data is collected in
a slightly different way, but the issue still boils down to whether or not the disproportionate (8%
vs. 25%) fraction of deaths in a particular time interval is substantial enough to lend credence to
the postponement theory. The population of interest here could reasonably be taken to be the
greater Salt Lake City area. Similar to the Lufkin/Nacogdoches data, all of the people featured in
the obituaries were within one region of the United States. Extrapolation to all of Utah, or all of
the United States would seem to be a stretch since Salt Lake City newspapers don’t tend to
contain birth and death announcements from around the country.

The sample is different from the Lufkin/Nacogdoches data in several ways. First, there is no
minimum age used to define the deaths. In the Lufkin/Nacogdoches data, all 18 of the deaths
pertain to people over 70 years old when they died. This isn’t the case in the Salt Lake City data.
Secondly, the data is taken over a longer time interval – all of the people in Lufkin/Nacogdoches
data set died within one week of each other. Third, and possibly most central to the mathematics
in this section, the Salt Lake City sample contains 747 people, whereas the Lufkin/Nacogdoches
sample was small – only 18 people.

Like with Case Question 1A, we should ponder the assumption that the 747 deaths that make up
the sample constitute a random sample of deaths across a year from the greater Salt Lake City
area. Some of the same issues that arose in assuming a representative sample in the
Lufkin/Nacogdoches area may be relevant in the Salt Lake City data as well. Going forward, we
will assume the 747 obituaries represent a random sample of all deaths during one year in the
greater Salt Lake City area.

Random Variable, Parameter and Statistic of Interest

Like Case Question 1A, each person in the sample can be represented by a Bernoulli Trial. For
i  1, 2, , 747 define

1 if the i th person is a "success"


Xi   th
.
0 if the i person is a "failure"

Here, a success represents a person dying within the three months prior to their birthday. A
failure corresponds to the person dying at some other time during the year. In corresponding
fashion, our relevant parameter and statistic are p and p̂ as defined below.

Let p = the proportion of all people in the greater Salt Lake City area that die within the three
months prior to their birthday.

13 | P a g e
Let p̂ = the proportion of people in our collected sample that died within the three months prior
to their birthday.

If the deaths occur in random fashion throughout the year, then p  0.25 . The data provided in
Case Question 1B tells us that among the 747 Bernoulli Trials, 60 of them were successes.
Namely, we have observed that  i 1 X i  60 .
747

Hypotheses To Be Tested

At this point, we have defined a population of interest and identified the collected sample.
Additionally, we have a random variable of interest and similar to Case Question 1A, we will use
sample proportion  p̂  to estimate a population proportion  p  . Our statistical inference
procedure now begins with assuming that the postponement of death theory is not true and forces
the data to present sufficient evidence to the contrary. Proponents of the postponement theory
will point to the fact that pˆ  8% in their argument. Is this sufficient evidence? We should not
presume it is – instead, we should test to see if this small value of p̂ is statistically significant.
If the postponement of death theory is correct, then based on the way that the data are collected
and the way that p is defined, we would expect p  .25 . Therefore, our null and alternative
hypothesis for Case Question 1B are

H 0 : p  .25 (the population proportion is 25%)


H A : p  .25 (the population proportion is less than 25%).

Make sure you realize that although we are testing the same “postponement theory”, the data has
been collected in a different way in Case Question 1B. This necessitates the alternative
hypothesis being lower tailed (notice the “less than” symbol), whereas H A was upper tailed in
Case Question 1A.

Sampling Distribution

In order to know whether or not pˆ  60 747 is a statistically significant result that is indicative
of H A , we must ask ourselves two very important questions:

 What other values of p̂ could we have obtained in other potential samples?


 How likely are the these values of p̂ if the null hypothesis is true?

These are the same two questions posed in Case Question 1A and they are universal questions
that are asked in any statistical inference problem that involves a hypothesis test. Answering
these questions in Case 1A led to considering  i 1 X i as a test statistic and the corresponding
18

null distribution was presented as Binomial. If we again assume that the times at which the

14 | P a g e

747
people died in Salt Lake City are independent, then our current test statistic, i 1
X i is also
Binomial.

For Case Question 1A, the first bullet above was answered with the list

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
, , , , , , , , , , , , , , , , , , .
18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18

The list of possible values for p̂ in Case 1B is much, much longer:

0 1 2 3 59 60 61 746 747
, , , , , , , , , , .
747 747 747 747 747 747 747 747 747

This long list has the potential to cause problems. First of all, let’s establish that we once again
have a binomial experiment:

 We have n  747 independent Bernoulli trials. The time of each death that determines
whether the trial is classified as a success or failure can reasonably be assumed to be
independent from person to person. If the null hypothesis is true, then the chance of each
trial resulting in a success can be claimed to be p  .25 . This likelihood is reasonable to
apply to each trial.
We are interested in the total number of successes since our statistic, pˆ  i 1 X i n ,
n

utilizes this total.

Let Y   i 1 X i . Then, from the above argument we state that Y is a binomial random
n

variable with parameters n  747 and p  .25 . Applying the binomial mass function to Case
1B, suppose Y is a binomial random variable with parameters n  747 and p  .25 . Then, the
chance of exactly 60 successes among the 747 trials is

747!
P Y  60   .25 .75 .
60 687

60!687!

Dealing with large exponents and factorial calculations this large can pose problems with
accuracy – even on modern hand-held calculators. The value of 60! is larger than an 8 with 80
zeroes behind it. Dealing with values this large should be done with extreme caution. This
computational concern will give us a chance to examine the features of the binomial distribution
when the value of n is large.

When calculating the p-value for the hypothesis test in Case 1A, we were able to create a graph
of the sampling distribution for  i 1 X i . Recall, a sampling distribution is a description or list
18

of the possible outcomes of a statistic along with the likelihoods of these outcomes. The graph
of the sampling distribution included all of the possible values of Y which were 0,1, 2,3, ,18 .

15 | P a g e

747
The graph of the sampling distribution for i 1
X i must include all of the integers from 0 to
747. Excel is able to generate this plot and it is printed below. Notice that the horizontal axis
only includes the values of Y from 125 to 250. This is because the probabilities for all other
values of Y are so small that they are effectively zero. So, to focus on the shape of the plot, these
values were excluded. Just think of the plot extending to the left and the right of what you
actually see, but hugging the horizontal axis very, very near zero out towards 0 in the left tail
and out towards 747 in the right tail.

0.04
P
0.035
r
o 0.03
b
0.025
a
b 0.02
i 0.015
l
0.01
i
t 0.005
y
0
125 145 165 185 205 225 245
Possible Values of Y

The Normal Approximation to the Binomial Distribution

The shape of this sampling distribution is unmistakable. The sampling distribution looks like a
bell-shaped curve. Now, make sure you understand: the plot above consists of just 748 isolated
points. That is, the sampling distribution is graphically represented by the 748 points (or
diamonds that Excel uses). But, because there are so many points represented in the graph, it has
the appearance of being a smooth curve. This fact can help us resolve the question in Case 1B.


747
The exact sampling distribution of i 1
X i is binomial with a very large value of n (n  747) .
From the picture above, it appears as though this exact sampling distribution could be accurately
approximated by a smooth curve. What we will do is replace the exact sampling distribution
with this approximate smooth curve and then, our computational difficulties discussed earlier
will be completely resolved. Finally, we will be in a position to easily calculate the
(approximate) p-value for our hypothesis test and resolve the question from Case 1B.

Recall that random variables that have a finite or countably infinite number of possible outcomes
are called discrete. Random variables that have an uncountably infinite number of possible
outcomes are called continuous. The binomial is a discrete random variable, but when n is

16 | P a g e
large, it can be approximated by the continuous random variable known as the normal random
variable.

When calculating probabilities associated with the binomial random variable (or any other
discrete random variable, for that matter), we can just plug appropriate values into a mass
function. Recall that all discrete random variables have mass functions and it is the job of a mass
function to provide the probabilities associated with all possible outcomes of the random
variable. The formula for the binomial mass function is
n!
p y 1  p  .
n y

y ! n  y !

This formula gives the chance of exactly y successes among the n trials.

Probability associated with continuous random variables must come from what is known as a
density function. All continuous random variables have density functions. It is the job of a
density function to provide all the probabilities associated with outcomes of the continuous
random variable. However, density functions achieve this goal in a different manner than mass
functions do for discrete random variables. Probability associated with continuous random
variables is calculated as area under the density curve, rather than by “plugging in” to the density
function. The chart below compares and contrasts how to find P  a  X  b  when using
random variables.

Type of Random Variable Relevant Function How to Find P  a  X  b 

Discrete Mass Function Plug in all values between (and


including) a and b into the mass
function and add up the results

Continuous Density Function Find the area under the graph of


the density function over the interval
 a, b  .

Characteristics of Normal Random Variables

The normal random variable and its associated normal density function are the most popular
continuous random variable and density function in all of probability and statistics. The density
function for a normal random variable has the following features:

 All normal density functions are symmetric around a value known as  . We call the
value of  the mean of the normal random variable.

17 | P a g e
 All normal density function are unimodal, meaning that the density function has one
peak. This one peak, or mode, is at  .
 All normal density functions have their width, or “fatness”, controlled by a value known
as  . We call the value of  the standard deviation of the normal random variable.

The terms “mean” and “standard deviation” will be refined and repeated in future cases. For
now, what is important is that you understand that the mean of a random variable is a measure of
location or center. It is the long-run average outcome that one would see if many, many, many
outcomes of the random variable were observed. The standard deviation is a measure of the
spread inherent in the outcomes of the random variables. The larger the standard deviation, the
more spread out the outcomes of a random variable. If a random variable has a very small
standard deviation, then the outcomes of the random variable will tightly cluster rather than be
grossly disperse. One other important characteristic of normal random variables is that it is quite
rare to observe an outcome of a normal random variable that is more than three standard
deviations above or below the mean.

The following display illustrates the effect of  and  on the graph of the normal density
function. In each case, the mean controls where the bell is located and the standard deviation
affects the girth, or width of the curve. Notice that when   0 , the curve with   1.5 is
“fatter” or wider that the curve with   1 . When   2 , the normal curve has located itself two
units to the right of the other two curves, but because   1 2 , this is the “thinnest” or most
“skinny” of the three curves plotted. Normal density curves exist for all values of  and for all
positive values of       ,   0  .

Previously, we have introduced the three axioms of probability. The second of these states that
we are certain to observe one of the possible outcomes in the set that exhausts all the potential
possibilities for an experiment. This axiom translates to continuous random variables in such a
way that the total area under all density curves is equal to 1 – representing 100% of the possible
outcomes of the random variable. So, even though the three normal curves in the display have
different locations and spreads, they all have the same total area underneath their curve.

18 | P a g e
Despite their being a different normal curve for each value of  and all for all positive values of
 , the normal curve with   0 and   1 is really the only one important for calculations and
statistical inference. This is because any probability (area under the curve) calculation that is
required for an arbitrary normal curve can be converted to a problem with the same answer that
uses this standard normal curve. The standard normal random variable is the normal
random variable with   0 and   1 . This “conversion” process is called standardizing a
normal random variable and is summarized below.

Standardization Theorem for a Single Normal Random Variable: If X is a normal random


variable with mean  and standard deviation  , then Z   X     is a standard normal
random variable.

Calculations With Normal Random Variables

We know that for a continuous random variable X, finding P  a  X  b  requires that we find
the area under the graph of the density function over the interval  a, b . Finding areas under
density curves is generally a calculus problem. For normal random variables, these calculus
problems have been solved by others and placed in tables for our use. Alternatively, modern
computer software such as Excel can calculate these areas for us.

It is important to understand the general philosophy that “standardizing” a random variable


involves subtracting the mean and dividing the result by the standard deviation. To solidify this
philosophy, we will briefly describe how to make calculations with normal random variables
using the standardization theorem. Once this philosophy is understood, then the quickest way to

19 | P a g e
make the calculations is in Excel. Going straight to the Excel code without an understanding of
the standardization process is dangerous and should be avoided. Standardization is a concept
that will emerge again in Case Studies 2A, 2B and 3A.

As an example of how to perform calculations with normal random variable, consider that the
height of a mature pine tree in the East Texas region could be modeled with a normal random
variable having mean   90 and standard deviation   8 . What is the meaning of the phrase
“could be modeled with”? What we are thinking about here is the entire population of pine trees
(say, of one particular species, like loblolly) in East Texas. We have no hope of measuring the
heights of all the mature pine trees in East Texas. But if we could, we might surmise that the
average height would be 90 feet and that the heights would “crowd” around the value 90 in such
a way as that values near 90 would be quite popular to see. Then as we moved away from 90
feet, the values would become less and less frequent, yet in a symmetric way. By this we mean
that a tree over 100 feet tall could be imagined to be just as likely to observe as a tree that is 80
feet or less. Finally, since it is rare to see observations more than three standard deviations away
from the mean when dealing with normal random variables, we might imagine that mature trees
above 114 feet or below 66 feet are quite rare to observe around East Texas. All of these facts
could be combined into a model – the model being the normal curve with mean 90 and standard
deviation 8.

Next, suppose with this model in place, we were asked for the chance that a mature pine tree in
East Texas grows to exceed 105 feet. How could this calculation be made? The answer is: we
need to calculate an area under the density curve. The area we need to obtain is shaded and
shown below.

Let the continuous random variable X represent the height of a mature pine tree in the East Texas
region. Then, we can write X N  90,8 to represent the fact that our model for X is normal
with mean   90 and standard deviation   8 . We need to calculate P  X  105 . This is
done by using the standardization theorem. Once the random variable X is standardized, we
denote the resulting random variable by Z. The capital letter Z will be reserved notation just for

20 | P a g e
standard normal random variables  Z N  0,1  . The constant “105” is also standardized and
then we can use tables or software to finalize the calculations. The process is illustrated below.

 X  90 105  90 
P  X  105   P   
 8 8 
 P  Z  1.88  .

A standard normal curve with the area to the right of z  1.88 is shaded below. By using the
standardization theorem, we are assured that the area shaded to the right of 105 when
X N  90,8 is the same as the area shaded to the right of 1.88 when Z N  0,1 . When using
tables to calculate areas under the standard normal curve, it is important to know which area the
table provides. The most common type of standard normal table provides a cumulative
probability. That is, the table provides P  Z  z  for a constant z. Since it is quite rare to
observe an outcome more than three standard deviations from the mean when working with
normal random variables, these tables are typically given for all values of z such that 3  z  3 .

Using a cumulative probability table for Z, we find that P  Z  1.88  .9699 . Thus, by the
complement rule,

P  Z  1.88  1  P  Z  1.88  1  .9699  .0301 .

We conclude that the chance of a mature pine tree in the East Texas region exceeding 105 feet is
approximately 3%.

Using a cumulative probability table for Z, we can:


 Find P  Z  z  by looking up the value of z in the margin and the desired probability is in
the body of the table.
 Find P  Z  z  by looking up the value of z in the margin, finding the probability in the
body of the table and then applying the complement rule.

21 | P a g e
 Find P  z1  Z  z2  by looking up the values of z1 and z2 in the margin, finding the two
probabilities in the body of the table and then subtracting the smaller probability from the
larger probability.

You should practice each of the three calculations with a variety of values for z until you are
confident.

Excel can make all necessary normal distribution calculations. Placing the cursor in any cell in
an Excel spreadsheet and typing the code

=NORM.DIST(value, mean, std deviation, TRUE)

will return the probability that X  value when X is a normal random variable having the values
of the mean and standard deviation indicated in your code. For instance, typing either

=NORM.DIST(105, 90, 8, TRUE) or


=NORM.DIST(1.88, 0, 1, TRUE)
produces a value of 97%. Remember that these two probabilities (shaded in the two pictures
above) are equivalent because of the standardization theorem. The complement rule can then be
applied to retrieve the 3% solution seen before.

To recap, the act of typing =NORM.DIST(z,0,1,TRUE) in an Excel spreadsheet is equivalent to


looking up a value of z in a cumulative standard normal table. Note, however, that Excel can
make the calculation for any normal random variable provided you type in the appropriate mean
and standard deviation. That is, Excel is “aware” of the standardization theorem and correctly
performs all necessary calculus. One last comment is in order before returning to the Salt Lake
City data: For continuous random variables, since there are uncountably many outcomes, one
does not have to worry about the difference in using “  ” vs. “<” or “  ” vs. “>”. Since we are
focused on area under curves, the inclusion or omission of one single value does not alter the
desired area. Be aware – there is a difference between using “  ” vs. “<” or “  ” vs. “>” with
discrete random variables (such as the binomial) and one must be particularly careful about the
inclusion or exclusion of single values when working in discrete cases.

Approximation to the Sampling Distribution and P-value Calculation

We now know that the normal distribution can approximate a binomial distribution when the
number of trials  n  is large. To determine whether the data from Salt Lake City is suggestive
of the postponement theory, we were faced with a binomial distribution with n  747 and
p  0.25 . The last step before we can calculate a p-value for our hypothesis test associated with
the Salt Lake City data is to determine which normal curve should be used to approximate our
binomial null distribution. We know that all normal random variables are categorized by their
mean and standard deviation, so how we achieve the proper approximation is to pick the normal
curve that has the same mean and standard deviation as the binomial random variable in use.
This is a simple task and is based on the following binomial random variable facts:

22 | P a g e
Mean and Standard Deviation of a Binomial Random Variable: A binomial random variable
based on n trials and probability of success p has mean   np and standard deviation
  np 1  p  .

Thus, a binomial random variable with n  747 and p  0.25 has mean   186.75 and standard
deviation   11.83 . The normal random variable that approximates this binomial random
variable should have the same mean and standard deviation. Therefore, our approximate
sampling distribution for  i 1 X i is N 186.75,11.83 . Next, recall the definition of p-value:
747

p-value: The chance of observing the value of the statistic from your sample (or one more
extreme) if, in fact, the null hypothesis is true.

Our null and alternative hypothesis for Case Question 1B are

H 0 : p  .25 (the population proportion is 25%)


H A : p  .25 (the population proportion is less than 25%).

Additionally, the data collected in the Salt Lake City obituaries produced 60 successes. The
most extreme evidence for H A would be zero successes. If this had occurred, then no one in the
sample would have died within the three months prior to their birthday. Everyone in the sample
would have “postponed” death until after their birthday. So, “extreme” evidence for the
alternative hypothesis in this case are low values of X  i 1 X i . By this logic, our p-value is
747

P  X  60  .

We know that the distribution of X is approximately N 186.75,11.83 and so our approximate p-


value is (using standardization)

 X  186.75 60  186.75 
P  X  60   P   
 11.83 11.83 
 P  Z  10.71 .

It is very rare to observe a value of a standard normal random variable outside the bounds
3  z  3 . So, the z-statistic of -10.71 that appears in our p-value calculation is incredibly
uncommon. While the value of z  10.71 is so rare that it is “off the charts” in normal tables
listed in textbooks, Excel gives the approximate p-value to be

=NORM.DIST(-10.71, 0, 1, TRUE) = 4.57 X 1027 .

23 | P a g e
Case Question 1B Concluding Decision

Our calculated p-value is very, very small, indeed! So, if the null hypotheses in Case Study 1B
is true, our sample produced an astronomically rare event. What we observed in the Salt Lake
City data is so rare, that we can no longer reasonably cling to the null hypothesis. If H 0 is true,
then why would we see such an incredibly rare outcome? Remember, the smaller the p-value,
the more evidence for H A . We have a very, very small p-value in Case Study 1B. Therefore,
the evidence is strong that we should reject the null hypothesis and claim that the postponement
of death theory is suggested by the Salt Lake City data. That is, assuming that the Salt Lake City
data was representative of the targeted population, we can no longer reasonably claim p  0.25 .
Instead, under the assumption that our data is representative of the population, the evidence
collected points strongly to p  0.25 .

Concepts and Terms To Review from Case Question 1B:

Population discrete random variable


Sample continuous random variable
Random variable normal random variable
Parameter density function
Statistic mean of a random variable
Population proportion standard deviation of a random variable
Sample proportion unimodal
Null hypothesis standardizing
Alternative hypothesis standard normal random variable
Lower tailed test cumulative probability
Upper tailed test complement rule
Sampling distribution mean and standard deviation of binomial random variable
Binomial random variable p-value

Statistical Follow-Up to Case 1B: An Introduction to Confidence Intervals

In addressing Case Questions 1A and 1B we developed a statistical inference procedure called a


hypothesis test. When conducting a hypothesis test, we set up a null and alternative hypothesis.
The null hypothesis is the hypothesis that the data do not indicate a change from the “status quo”
or the manner in which things have occurred in nature to this point. The alternative hypothesis is
a hypothesis of “change” – it is often the hypothesis that the experimenter is hoping to
substantiate.

We assume that the null hypothesis is true as we progress through the testing procedure. We
decide on a test statistic, the value of which we can calculate from our observed data set. The
test statistic is some summary of the random variables central to the statistical problem and so
the test statistic has a probability distribution. This distribution describes the behavior of the

24 | P a g e
outcomes for the test statistic in repeated samples. The distribution of a test statistic when the
null hypothesis is assumed true is called the null distribution. From the null distribution, we
can use the definition of p-value to ascertain how likely or unlikely our sample results are if H 0
is true. If the p-value is low, we have evidence for rejection of H 0 .

Aside from the hypothesis test procedure, another very popular statistical inference technique is
the creation of a confidence interval for the parameter of interest. The confidence interval
provides the experimenter with a reasonable “range of guesses” for the parameter based on the
data collected. Each confidence interval begins with the choice of a confidence coefficient. In
this section, we will choose this coefficient to be 95% (0.95). Reasons for this choice will be
discussed in Case Study #2, but recall that statisticians involved in statistical inference problems
generally don’t label p-values as rare until they fall in the lower or upper 5%-10% of the null
distribution. When this statement is applied to the concept of confidence intervals, it can be
interpreted as saying that we typically choose confidence coefficients to be around 90%-95%.
Another popular choice is 99%.

Let’s create a 95% confidence interval for a population proportion in the context of Case
Study 1B. Once we look specifically at the data from Case 1B, we will be in a position to
generalize the confidence interval procedure and develop a formula for a large sample
confidence interval for a population proportion.

In Case 1B, our test statistic was the total number of successes observed in the sample which was
denoted as X. The approximate null distribution of X was N 186.75,11.83 . The first step in
constructing a 95% confidence interval is to determine the 95% “most frequent” outcomes of the
test statistic, by looking for the central 95% portion of the null distribution. This is shaded in the
figure below. Notice that this leaves 2.5% of the area under the density curve in each tail of the
null distribution. Further, notice the null distribution is centered over the mean which is 186.75.

If the null hypothesis is true, and were able to repeatedly sample the target population, then the
value of the test statistic X would fall in the shaded region 95% of the time. So, what are the
boundaries of this shaded region? This is a key question – one that is critical to discovering a
formula for our confidence interval.
25 | P a g e
While practicing how to make normal distribution calculations from a table or Excel, one
problem we could have encountered is to find P  1.96  Z  1.96  . The answer to this
probability question is 95%. Try it using Excel or a table.

This means that 95% of the time, the outcome of a standard normal random variable falls within
the interval (1.96, -1.96). But, it has an even grander interpretation than this! Because of the
standardization theorem, the fact that P  1.96  Z  1.96   .95 means that 95% of the time, the
outcome of any normal random variable falls within 1.96 standard deviations of the mean. This
is very important to remember.

In developing the hypothesis test procedure for Case 1B, we approximated a binomial random
variable X with a normal random variable having the same mean and standard deviation. Since
the mean of a binomial random variable is np and the standard deviation is np 1 p  , we can
standardize X in the following way:

X  np
Z .
np 1  p 

When n is large, this quantity is approximately standard normal. Because of this we can
substitute  X  np  np 1  p  for Z in the expression P  1.96  Z  1.96   .95 . This
substitution is the second step in arriving at an expression for a confidence interval for the
population proportion, p. When this substitution is made we obtain

 X  np 
P  1.96   1.96   .95

 np 1  p  

The inequalities inside the parenthesis can be algebraically rearranged. This algebraic
rearrangement will not change the truth of the probability statement; instead, it will simply make
the expression look different. Rearranging the above expression can produce

X p 1  p  X p 1  p  
P   1.96  p   1.96   .95 .
n n n n 
 
For the curious reader, the algebra required to obtain the second rearrangement from the first is:
multiply all terms by np 1 p  , subtract X from each resulting term, divide all of what you
obtain by n and then reverse the inequality signs since you divided by a negative value.
Essentially what has been done is that the “p” in the “ np ” term from the numerator has been
algebraically isolated in the center of the inequality. Isolating parameters in inequalities like this
is the mathematical procedure by which confidence interval formulas can be determined. The

26 | P a g e
isolation of the parameter through the use of algebra is the third and final step in developing
confidence interval formulas.

Despite the fact that the value of “p” is isolated in the center of our inequality, it unfortunately
also appears in the endpoints. That is, “p” still appears in the term on the left and on the right in
the string of inequalities. This does not always happen when statisticians derive confidence
interval formulas. We will see in Case 2 that once the parameter is isolated, it does not appear in
the endpoints. However, when the parameter does appear in the endpoints, the typical resolution
is to replace the parameter by its estimating statistic in order to complete the formula.

We know that the population proportion p is estimated by the sample proportion pˆ  X n . So,
replacing p by p̂ we finally reach the expression for a large sample confidence interval for a
population proportion:

Large Sample 95% Confidence Interval for a Population Proportion: Estimating a


population proportion, p, by collecting data on a large number of independent Bernoulli trials can
be done using the expression
pˆ 1  pˆ 
pˆ  1.96 .
n

Here, the sample proportion is given by pˆ  X n where X is the total number of successes
observed among the n Bernoulli trials.

When the formula above is applied to the Salt Lake City data set, we get

60
 1.96
 747
60
 747
687
 .
747 747

After some brief calculator work, we conclude that the 95% confidence interval for the
proportion of people in the greater Salt Lake City area that die within the three months prior to
their birthday is .061,.100  . Based on the sample of 747 Salt Lake City obituaries, we believe
that a reasonable range of guesses for the proportion of people that die within the three months
prior to their birthday is 6.1% to 10.0%. We say that we are “95% confident” that the value of p
is between .061and .100.

Remember that p is a parameter. Thus, it is an unknown constant that we will never know
exactly unless we observed EVERY member of the population. For this reason, we do not speak
of a “95% probability of p being between two values”. Such a phrase is just as nonsensical as
saying that there is a “95% chance that 5 is between 6 and 7”or that there is a “95% chance that 5
is between 1 and 2”. Either the true, unknown value of p is or is not between .061 and .100.
Based on our sample, it is reasonable to estimate p to be between .061 and .100. Since p is a
constant, we refrain from using the word “probability” in reference to p. Thus, the necessity for
a new word – that word being “confident”. The “confidence” that we have actually isn’t in the
two numbers that comprise our confidence interval. Interpreting a confidence interval is never

27 | P a g e
about you or about the results that you personally obtained from your sample. Instead, the
confidence is in the mathematical procedure that has been outlined above.

To get a better handle on the interpretation of a confidence interval, imagine that 100 friends
each had collected a data set of the same size as that of the Salt Lake City data set. Then “95%
confident” means that if all 100 friends follow the procedure outlined above while using their
sample data, then we could expect 95% of our friends to generate a confidence interval that
would include the true, unknown value of p. This leaves 5% of our friends to have an interval
that would not include the true, unknown value of p. This is as far as the interpretation can go
because since p is unknown (we are trying to estimate it). We won’t know which friends trapped
p in their interval and which friends did not. However, despite not knowing who got it and who
didn’t – we are confident that 95% did!

Concepts and Terms To Review from the Follow-Up to Case Question 1B:

Hypothesis test confidence interval


Test statistic confidence coefficient
Null distribution large sample confidence interval for population proportion
p-value interpretation of confidence interval

28 | P a g e
Case Study #2A: Rapid Aging in Children
Introduction

A very rare genetic disorder called HGPS (Hutchinson-Gilford Progeria Syndrome) produces
rapid aging in young children. Approximately only one in 8 million children are born with the
disorder. Geneticists and other scientists are very interested in studying HGPS for hope that the
normal biological process of aging in humans can be better understood. No known treatment has
consistently proven effective in reversing the effects of HGPS. Many people with HGPS
develop heart disease before they are 20 years old. Pulse wave velocity (PWV) is a standard
measure of vascular stiffness and this variable is a very important factor in a person’s heart
health. Do children with HPGS tend to have larger PWV values than those children without the
disease? If so, then the drug lonafarnib, an enzyme inhibitor, may be justified as part of the
treatment for children with HPGS. The drug is thought to be effective in lowering PWV values
and thus could prolong and increase the quality of life of a child with HGPS.

Case Question 2A

In 2012, a large research team was formed in Boston to study HGPS and the effect on PWV in
young children. They gathered data from 18 children with HGPS and measured PWV values
with the units being meters per second. The 18 observations are ranked and presented below.

7.2 7.9 8.3 8.3 9.1 9.3 10.1 12.4 12.9 12.9 13.1 13.7 14.1
14.8 16.0 17.5 17.6 18.8

Among children, a PWV value above 6.6 m/s is considered abnormal. How strong is the
evidence in the data regarding the claim that HGPS children have abnormal PVV values on
average?

A simple visual inspection of the data certainly gives the impression that the evidence is very
strong in favor of the conjecture that HGPS children have abnormal PWV values on average.
After all, every data point in the sample lies above 6.6. But, could the entire HGPS population
have a PVV average of 6.6 and it just so happen – by chance alone – that we could observe 18
children with PWV values above 6.6? It is possible this is what happened. But is it likely? Just
how unlikely is it? Further, is it so unlikely that the best decision to make based on the evidence
is that HGPS children have abnormally high PWV values on average?

We need a statistical procedure to either validate or refute what we are seeing in the data. HGPS
is a serious disorder and treating children with enzyme inhibiting drugs is something that
requires careful, scientific study. The quality of life of children is at the center of the decision
making in this case. We simply cannot “look” at the data and from that glance alone be ready to
scientifically recommend a drug for treatment. We need to quantify – with probability
calculations and sound statistical procedures – the body of evidence collected by the Boston team
of researchers.

29 | P a g e
Before going any further, imagine you and some friends are watching a basketball game and the
announcer tells you that the player currently about to shoot free throws can make that shot with
50% accuracy. It is possible that a free-throw shooter who, on average, makes 50% of his shots
can now sink 18 free throws in a row? Yes, it is possible. But, is it likely? The binomial
distribution of Case 1 can help us with the calculations, here. Assuming each free throw to be an
independent Bernoulli trial with probability of success, p  .5 allows us to model the number of
free throws made as a Binomial experiment. Out of n  18 trials, the chance we will observe
y  18 successes is

n! 18!
p y 1  p   .5 1  .5  .000004 .
n y 18 0

y ! n  y ! 18!0!

So, someone who on average makes 50% of their foul shots could make 18 straight, but our
binomial calculation tells us to expect this only 4 times out of every million.

When we think about this basketball example as a metaphor for the fact that all 18 PVV values in
the Boston data are above (rather than below) 6.6 m/s, it does indeed appear as though we are
discovering evidence based on probability – and not just visual inspection – that HGPS children
tend to have abnormal vascular stiffness.

Population and Sample

Despite the research team being located in Boston, the children in the study we from a variety of
locations all around the world. The children ranged in age from three to 16. Thus, the
population targeted for inference can reasonably be taken to all children with HGPS and the
sample consists of the 18 children which had PWV values measured for the study. Recall that
the goal of statistical inference is to use a sample which is representative of the population to
infer characteristics of population at large. Here, we are hoping to statistically infer information
about vascular stiffness in HPGS children by using the sample collected by the Boston team.
The procedures developed below are with the reasonable assumption that the children in the
sample constitute a random sample of HGPS children. Recall that a random sample of size n is
a sample that is just as likely to have been chosen as any other sample of the same size. While
the Boston research team could not enumerate the list of all children in the world with HGPS and
then select names at random, they did take precautions to select children that they felt were
reasonably representative of the population of kids with HGPS. Often, we must make every
effort to select a sample without biasing factors and then presume that our sample is a random
sample.

When sampling a population, consideration should be given to whether or not outside factors
exist that would cause the sample to misrepresent the target population. If these factors can be
removed, then one often assumes the collected sample is representative of the population under
study. At other times, however, the population has to be redefined to accommodate challenges in
sampling.

30 | P a g e
Random Variable, Parameter and Statistic of Interest

The variable being measured on the children is PWV in meters per second. This is a continuous
random variable that can theoretically take on any positive value. The larger the PWV value
then the more stiff the arterial wall and then consequently, the more at risk a person is for
cardiovascular problems. Look back at the question in Case 2A. We want to know if the data
support the claim that HGPS children have abnormal PWV values on average. Our investigation
is not trying to make a claim about a single child. Instead, our parameter of interest is the
average PWV value of HGPS children. This average could be over 6.6 m/s, but yet some HGPS
children have values much smaller (or larger) than this. The current Case 2A question is a
statistical inference problem about a population mean. Often in statistical investigations we
use the words “mean” and “average” interchangeably. All of the statistical issues in Case 1
centered around a population proportion. Our data consisted of Bernoulli trials, not
measurements on a continuous random variable.

Since our parameter is a population mean (average), it makes logical sense that we would
estimate or approximate this population mean with the mean (average) of the collected sample.
When the data from a sample consists of just zeroes and ones (failures and successes), then the
basic way to summarize the data is to total up the number of successes or report a sample
proportion. This is what was done in Case 1. Now that the data consists of measurements on a
continuous variable, we need to look at additional statistics that can summarize the information
collected.

Descriptive Statistics For Measurements on a Single Quantitative Variable

Descriptive statistics is the branch of statistical science that attempts to summarize and describe
data by using graphical and numerical techniques. In contrast to statistical inference, descriptive
statistics seeks to “paint a picture” of the collected data rather than make data-driven claims
about the population. The root word of descriptive is describe and that is exactly what we are
trying to do – describe the features of the data. Describing data before formal statistical analysis
is important. Often, we can “see” features in the data from a figure or “discover” trends and
behaviors in the data by looking at a set of numbers or graphs. Blindly picking statistical
inference techniques without the use of descriptive statistics is scientifically dangerous and is
strongly discouraged. Statisticians often talk about “looking at the data” and what they mean is
that using descriptive statistics is a proper precursor to selecting the right inferential technique
for decision making.

One numerical descriptive statistical technique has already been alluded to: the sample mean.
Data that is naturally numeric is called quantitative. Quantitative data is often summarized with
measures of center and measures of spread. Measures of center attempt to summarize the
center-point or congregating point that the data gather around. Often, but not always, a measure
of center can describe where the bulk of the data are located. Measures of spread attempt to
numerically describe how tight or loose the data are clustered. Is the vast majority of the data
near one value and there is very little dispersion in the values? Or, are the data spread out

31 | P a g e
loosely and arrange themselves in a more scattered and disperse way? Measures of spread try to
give numerical answers to questions such as these.

For data collected on a quantitative variable, the most popular measure of center is the sample
mean and the most popular measure of spread is the sample standard deviation. We have seen
the words “mean” and “standard deviation” before in the context of binomial and normal random
variables. The binomial random variable is an example of a discrete random variable whereas
the normal is a continuous random variable. The binomial mass function and the normal density
function are models for populations. Here, we are introducing the mean and standard deviation
of a sample, not a population.

When we say that a random variable X has a normal distribution with mean 90 and standard
deviation 8, then we are making statements about a population of values. The normal
distribution is the mathematical model for all the values in the population and we believe that the
mean of all the values in the population is 90 and that the standard deviation that represents all of
the values in the population is 8. This isn’t the scenario that the HGPS data in Case 2A presents
to us. We have only 18 values from a larger population. Our parameter of interest is the
population mean PWV value, but we can only calculate the average from our sample. When we
speak of the sample mean we are simply using the average of the data we have collected in the
sample.

Sample mean – the average of the measurements on a quantitative variable collected in the
sample. We calculate the sample mean by taking the total of all the observations on our variable
and dividing by the sample size. If the variable we are studying is denoted X, then the sample
mean is denoted X . If the sample of measurements collected is denoted x1 , x2 , xn , then the
calculated value of the sample mean is x  i 1 xi n .
n

We use the notation X when abstractly speaking about a random variable. As such, we often
write phrases such as X  the PWV value on a child with HGPS. We use the lowercase notation
x to represent the numerical outcome of the variable. For instance, the first observation in our
data set is x  7.2 . Likewise, we use the capital X when speaking about the sample mean as a
random variable, yet unobserved. However, once the value of the sample mean is known, we use
the lowercase x .

For instance, for the data in Case 2A, x  i 1 xi n  224 18  12.44. That is, the average PWV
n

value in the sample of 18 children is 12.44 m/s. This is seemingly quite a bit larger than the
value of 6.6 m/s which is of interest in the case study. Soon, we will evaluate whether the
difference in these two values is statistically significant to the level of making the claim that
HGPS children have abnormally high PWV values on average. In order to make this decision,
we will also need the sample standard deviation.

32 | P a g e
Sample standard deviation – the square root of the sum of the squared deviations from the
sample average divided by one less than the sample size. Here, a deviation is the difference
xi  x , where xi denotes the ith observation of the variable in the data set, i  1, 2, , n . The
sample standard deviation is denoted as S. The calculated value of S is denoted by s. The
sample standard deviation is calculated using the formula

 x  x
n 2

s i 1 i
.
n 1

The value underneath the square root is called the sample variance and is denoted s 2 .

The sample standard deviation is interpreted as the typical distance that a data point falls away
from the sample average. The larger the value of s , the more spread out the data are around x .
Small values of s indicate a tight clustering of the data points around x . One suggestion for
keeping track of the calculations necessary in obtaining s is to create a chart like the following
one for the HGPS data.

Data Point  xi  Deviation  xi  x  Deviation Squared  xi  x 


2

7.2 -5.24 27.4576


7.9 -4.54 20.6116
8.3 -4.14 17.1396
8.3 -4.14 17.1396
9.1 -3.34 11.1556
9.3 -3.14 9.8596
10.1 -2.34 5.4756
12.4 -0.04 0.0016
12.9 0.46 0.2116
12.9 0.46 0.2116
13.1 0.66 0.4356
13.7 1.26 1.5876
14.1 1.66 2.7556
14.8 2.36 5.5696
16.0 3.56 12.6736
17.5 5.06 25.6036
17.6 5.16 26.6256
18.8 6.36 40.4496

 x  x
18
 224.9648 . The last step in the calculation is to
2
Totaling up the last column gives i 1 i

divide the total of the squared deviations by n  1  17 and then take the square root. This leads
to the sample variance being s 2  13.23 and s  3.64 . So, the typical distance that a PWV value
in the sample tends to differ from the average of x  12.44 is by the amount 3.64.

Of course, the potentially tedious calculations involved in finding the sample mean and sample
standard deviation can be expedited with statistical software or programs such as Excel. If the

33 | P a g e
18 PWV values in the sample are typed into cells A1 through A18 (down one column) in an
Excel spreadsheet, then clicking in any other cell on the sheet, the user can implement the
following Excel functions to calculate the sample mean and standard deviation:

=AVERAGE(A1:A18)
=STDEV.S(A1:A18)

For calculations by hand, it is often helpful to use the computational (short-cut) formula for s
which is
 x
 xi2  n i
2

s .
n 1

The reason why this is considered a short cut is because usually the value of  xi is known
coming into the calculation. The sample mean is typically calculated before the sample standard
deviation, so the data will have already been totaled. The term  xi2 requires squaring all of the
original data points and then summing these squared values. The reader can check that for the
PWV values in Case 2A we have (as expected):

3012.52  224
2

s 18
 13.2332  3.64 .
17

Aside from the sample mean as the main measure of center for a quantitative data set, we can
also calculate the sample median. The sample median is the value in the middle of the ordered
list of the data. If the sample size is even, then we average the two values that are in the middle
of the ordered list. This is the case for the PWV data in Case 2B, where the sample median is the
average of the two values 12.9 and 12.9, which is, of course, 12.9. The sample median
designates the value for which half of the data falls above the value while the other half of the
data falls below. Discussions of situations where it is advantageous to use the sample median
over the sample mean when measuring center will occur in the future.

Aside from the sample standard deviation, another measure of spread in a data set is the sample
range. The range is simply the difference between the largest and smallest values in the data set.
The sample range is 11.6 for the PWV data. The sample range suffers from being unduly
influenced by outliers at the extreme ends of the data. The term “outlier” should be used
generally to describe any data point that falls substantially out of the general pattern established
by the bulk of the data. At times, values that are extremely large or small when compared with
the remainder of the data can be identified. These values should be investigated to see if there is
an assignable cause for why they are so extreme. Outliers can also be seen visually in graphical
descriptive statistics techniques.

It should be noted that x (and even s) can be affected by the presence of outliers. Statisticians
have developed other (albeit less popular) measures of center and spread that can be used to be
more representative of center and spread in the presence of outliers. These statistics will not be
discussed here (other than the median), but can be found in many popular statistics texts. The

34 | P a g e
presence of outliers is one case in which the sample median is often reported. The sample
median is robust with respect to outliers – meaning that it is not influenced by extreme
observations. After all, it reports only the value in the middle of an ordered list and does not
even factor in the most extreme data points.

In addition to numerical measures that describe the features of quantitative data, it is of upmost
importance to be able to interpret graphical descriptions of the sample. The most common
graphical display of a quantitative variable is a histogram. Construction of histograms begin by
grouping the data into bins that typically are of equal width. Then, either the frequency, or more
often the relative frequency of the data falling within each bin in determined. The relative
frequency of a bin reflects the fraction of the sample that falls within the bins boundaries. A
histogram then plots these bins and relative frequencies creating a graphical display of the data.

One should not get caught up with a prescription or elaborate set of rules while trying to
construct histograms. The most important thing to remember is that the histogram is a summary
of the data and from it we should be able to describe the main features in the data set. Almost
always, there is more than one way to accurately construct a histogram for a set of data.
Implementation of statistical software is encouraged when creating histograms since these
computer programs use algorithms that deduce a number of bins appropriate for the data set.
This is often the key dilemma that one faces when constructing a histogram by hand: How many
bins do I need and how wide should they be? There is no definitive answer to such a question.

Generally speaking, the larger the data set, the more bins can be used to accurately portray the
features of the data. That being said, many histograms utilize in the neighborhood of 4-12 bins
depending on sample size. Creation of a histogram is as much an artistic process as it is
scientific. To outline the general strategy, consider again the data of Case 2A. A data set with
only 18 observations is on the low end in terms of sample size that can lead to an interpretable
histogram. Very small data sets may produce histograms that are not informative. Creating
histograms for data sets with extremely small sample sizes (say, less than 10) should be avoided.
In those cases, a scan for extreme outliers and a few numerical measures will suffice. After all,
extremely small data sets don’t need much summary.

The data from Case 2A could be quickly broken up into six bins of length two:  7,9 ,  9,11 ,
11,13 , 13,15 , 15,17 and 17,19 . This is certainly not the only set of bins for the data that
would be reasonable. However, it is a simple and neat division of the range of values seen in the
data that will suffice as bins. Respectively, the relative frequencies of the data observed in these
six bins are: 4/18, 3/18, 3/18, 4/18, 1/18, and 3/18. The histogram using the bins  7,9 ,  9,11 ,
11,13 , 13,15 , 15,17 and 17,19 is plotted via computer at the top of the following page.
Notice that the midpoints of each cell are labeled on the horizontal axis and the vertical axis
displays the actual cell frequencies rather than relative frequencies. On the horizontal axis, the
boundaries of the cells themselves could have been indicated (7,9,11,13,15,17,19). As long as
these labels are clear, these individual plotting decisions are far from the main point. In fact,
plotting frequencies on the vertical axis as opposed to relative frequencies does not alter the
shape of the histogram., which is the primary issue.

35 | P a g e
For the most part, this histogram contains bars that are similar, or uniform, in height. There are
no bars isolated by themselves. This is due to the lack of outliers. Notice that if one were to
“balance” a histogram for a quantitative variable, it would balance at x . This is confirmed in

our histogram for the PWV data since we would visually inspect the graphs’ balancing point to
be somewhere among the bin from 11 to 13. Indeed, recall x  12.44 .

Another histogram for the PWV data is provided at the top of the following page. This time, a
cell width of three was used with only five bins. In this plot, the cell boundaries appear and the
relative frequency (as a decimal) is labeled as “probability” on the vertical axis. These two plots
show that for small data sets, seemingly minor changes in decisions that create histograms can
lead to different appearances. As the sample size grows, small changes in the number of bins
and their width will have little impact on the final shape of the histogram. The second histogram
has a more mound-shaped and unimodal look. A normal density curve is overlaid in this plot
purely for accent and comparison. Again, we see no evidence for outliers since there are not
isolated bars or sets of bars on the extreme left or right side of the plot. Depending on how the
data are grouped, we might conclude that the data either look approximately uniform or
approximately normal or even possibly some hybridization of the two shapes. What is important
for us is that there is no radical indication of outliers or skewness.

The term skewed is often used in statistics as an antonym to symmetry. Data are skewed if the
frequencies or relative frequencies “tail off” to the right or the left. Data that are skewed can be
identified by a general trend of stair stepped histogram bars either on the left or right hand side of
the plot.

36 | P a g e
Data that are skewed do not necessarily have histograms that continually step down to the right
or left as the following figures show. The data can have a mode that is interior to the plot and
then begin stepping down to the right or left. Data such as these are still referred to as skewed
right or skewed left because they show a unique “tail” of the data’s distribution on either the
right or the left.

37 | P a g e
While there are many shapes of histograms that could be encountered, it is important to be able
to identify the features of the data using the terminology given in this section. For an additional
illustration, consider the following three histograms that are roughly symmetric. In the first
picture, the data are basically uniform whereas the second histogram is essentially normal
(symmetric and unimodal). The third histogram is an example of a data set that could be
described as symmetric and bimodal.

38 | P a g e
The statistical software JMP can easily calculate numerical summary statistics and create
histograms. Data in JMP is entered in a “Data Table”, which is synonymous to a spreadsheet in
Excel. Once all of the PWV data are entered in a single column in the data table, the user can
click on the Analyze>Distribution menu and then indicate the variables that they would like to
explore. The default JMP output for creating descriptive statistics includes a histogram as well
as set of summary values. The mean and standard deviation are included in this default display
of summary statistics. Clicking on the red triangles among the JMP output will allow the user to
explore various options and customize the output.

The default JMP output for the descriptive statistics on the PWV variable from Case 2A are
displayed below. Through the aforementioned options, the histogram can be displayed in a more
traditional horizontal fashion. Notice that the default JMP histogram uses seven bars of width 2
– an even different choice than the two histograms previously explored. Despite this, the rough
features of symmetry and lack of outliers are again present.

Distributions
PWV

Quantiles
100.0% maximum 18.8
99.5% 18.8
97.5% 18.8
90.0% 17.72
75.0% quartile 15.1
50.0% median 12.9
25.0% quartile 8.9
10.0% 7.83
2.5% 7.2
0.5% 7.2
0.0% minimum 7.2

Summary Statistics
Mean 12.444444
Std Dev 3.6377469
Std Err Mean 0.8574252
Upper 95% Mean 14.253453
Lower 95% Mean 10.635435
N 18

39 | P a g e
The p th percentile or quantile is the value for which p% of the data is equal to or below. JMP
uses a statistical algorithm to interpolate these values when they do not fall exactly on one of the
data points in the sample. For instance, JMP’s algorithm estimates that the 25th percentile of the
PWV data is 8.9. Using JMP’s algorithm, our interpretation is that 25% of the data in the sample
fall at or below the value 8.9. The 50th percentile is another name for the sample median and
the 25th and 75th percentiles of a sample are termed the first quartile and third quartile of the
data. Earlier, we alluded to some of the other measures of center and spread that statisticians use
to summarize features of data. Some of these numerical measures make use of quartiles.

Adjacent to the above histogram is another graphic called a boxplot. The boundaries for the
large rectangular boxes in the plot are determined by the quartiles. The line segment in the
middle of the plot dividing the large rectangle into two smaller rectangles is the sample median.
The so called “whiskers” in the plot vary in length depending upon the particular algorithm being
used. Often, a formula involving quartiles is used to determine the whisker length. The relative
sizes of the two rectangles and the set of whiskers can give a hint at the symmetry or lack thereof
in the data set. When the two rectangles are roughly the same size and when the lengths of the
two whiskers are approximately the same, we have evidence for symmetry. If one box is much
larger than other – and correspondingly, one whisker is much longer than the other – then this is
evidence for skew.

Hypotheses To Be Tested

The issue in Case 2A is investigation of the claim that HGPS children have abnormal PWV
values on average. Our parameter of interest is the average PWV value of HGPS children. The
value of a population mean is often denoted by the symbol  . Thus, we will let  be the
average PWV value of HGPS children in the population of interest. The value of  is estimated
by the sample mean, x . Briefly, we review the features of the sample that have been uncovered
through the use of descriptive statistics.

 The sample mean is x  12.44 and the sample standard deviation is s  3.64 , both of
which are based on a sample size of n  18 .
 The sample size is small and so a strong claim about the shape of the data is difficult to
make. There is some evidence that the population sampled is either roughly uniform or
normal, but there are no noticeable outliers or extreme skew indicated in histograms.

A PWV value above 6.6 m/s is considered abnormal. We need to investigate the evidence in the
data regarding the claim that HGPS children have abnormal PVV values on average. We begin
our statistical inference procedure by requiring the data to establish this claim. Thus, our null
and alternative hypotheses are

H 0 :   6.6
H A :   6.6.

40 | P a g e
If we do not reject the null hypothesis, then we will not have sufficient evidence to refute the
claim that the average PWV value of HGPS is a healthy value of 6.6. If the null hypothesis is
rejected, we will have found evidence that states HGPS children tend to have abnormal PPV
values on average.

Test Statistic and Sampling Distribution

In order to know whether or not x  12.44 is a statistically significant result that is indicative of
H A , we must ask ourselves two very important questions:

 What other values of X could we have obtained in other potential samples?


 How likely are the these values of X if the null hypothesis is true?

We have seen these two questions in other contexts before. In order to conduct a hypothesis test
for a population parameter, we need a test statistic and a null distribution. Answering the two
questions above generates both of these entities.

Since the parameter being tested is  and the statistic which we are using to estimate  is the
sample mean, X , it is worth investigating X as a potential test statistic. We now arrive at the
key question for Case 2A: What is the distribution of X if the null hypothesis concerning
 is true? The answer to this question gives rise to one of the most fundamental ideas in the
history of statistical science.

To know the exact distribution of X when H 0 is true requires that we know another distribution
first: the distribution of PWV values in the entire population. This poses an obstacle since we
do not know the distribution of PWV values among all HGPS children. If we did, then the
question in Case 2A would be irrelevant to explore since it would already have been answered.
We know only what the sample has provided us. Histograms and summary statistics hint at the
possibility that the population could be roughly symmetric with a mean near 12.44 and standard
deviation near 3.64. But, make sure you understand: the histogram was for the data collected in
the sample. The value of 12.44 refers to the average of the sample and the standard deviation of
3.64 was calculated from the sample. The best (only?) reflection of the population is what we
have seen among the 18 PWV values in the sample.

Thus, without firm knowledge of the distribution for the population of PWV values, we are in a
position to make an assumption. This is not a rare position to be during a statistical
investigation. For that matter, it is not a rare position be in during a scientific investigation.
Mathematics provides an aid in our current case through the following theorem.

Theorem: Suppose a random sample of size n is collected on a variable. If the population of


possible outcomes for this variable has a normal distribution with mean  and standard
deviation  , then the sample mean, X also has a normal distribution with mean  , but with
standard deviation  n.

41 | P a g e
Despite this theorem having meaningful mathematical implications in the theory of statistical
science, it is not an overly helpful result in the pursuit of resolving Case 2A. We need to
understand why it is not of direct help and then discuss a result similar to the theorem which is of
upmost importance to our problem. Before moving on, notice that the theorem tells us that
averages of normal random variables have smaller standard deviations that the original random
 
variables that comprise the averages  n   . This is true when averaging random
variables in general. Succinctly said, averaging reduces variability. This is an important concept
and it will be seen again while exploring Case 2B.

From the study of normal random variables we know that standardizing involves subtracting the
mean and dividing by the standard deviation. Applying this to X we obtain the expression

X 
Z .
 n

This is called the standardization formula for the sample mean and combined with the
previous theorem we now know that if we are sampling from a normal population that this
random variable has a standard normal distribution. You should pause and recognize at this
point - if we are sampling from a population that is not normal, then we are unsure about the
distribution of Z – the theorem doesn’t pertain to non-normal populations.

The reason that the standardization formula for the sample mean is not of direct help is because
 is not known. We do not know the standard deviation of PWV values in the entire HGPS
population. We do know that the standard deviation of the 18 PWV values in our sample is
s  3.64 . With  unknown, the above formula is of no help in calculations since a numerical
value of Z cannot be determined. The sample mean, X can be calculated (12.44). The value of
 has been assumed by the null hypothesis (6.6). The sample size is known (18). However, 
is unknown.

We have seen that statistics estimate parameters. This can surely be done now. Despite not
knowing  exactly, we can estimate it with s  3.64 . Thus, it seems logical to replace  with s
and attempt to carry on with a resolution of Case 2A. A statistic that we can calculate would be

X 
.
s n

As is so often the case in science, solving one problem leads to another. Despite having created
a statistic that we can calculate, the previous theorem doesn’t address the use of s at all. The
sample standard deviation appears nowhere in the theorem statement and so we have no
guarantee that X still has a normal distribution in this case.

Now, for many years scientists and statisticians believed that X did still have a normal
distribution when  is replaced with s in the theorem. But, they were wrong. In 1906, a man
named William Gossett laid the groundwork for the conclusion that

42 | P a g e
X 
s n

does not have a standard normal distribution when have collected a random sample from a
normal population. Instead, Gossett argued that a new distribution was needed in order to
correctly address questions like the one in Case 2A. Gossett was a student of a famous
statistician named Karl Pearson. Because of this, through time, the word “student” has been
associated with this new distribution that Gossett discovered.

Theorem: Suppose a random sample of size n is collected on a variable. Suppose the population
of possible outcomes for this variable has a normal distribution with mean  and standard
deviation  . If  is unknown and replaced by the sample standard deviation s, then the quantity

X 
t
s n
has a t distribution with n  1 degrees of freedom.

This theorem provides us with a test statistic and null distribution for the resolution of the
hypothesis test in Case Question 2A. When dividing by a quantity that involves the sample
standard deviation  s  rather than the population standard deviation   we use the phrase
studentizing, rather than standardizing. Thus, the quantity in the theorem is the studentization
formula for the sample mean. It is more common to just refer to the formula as the "one
sample t-statistic".

The theorem above is a major step towards calculating a p-value in order to resolve Case 2A.
However, realize that in order to use either theorem – and we desire to use the latter one – we
must assume that the data in our sample came from a normally distributed population. Our
theorems only address the situation where the distribution of the population is normal. We do
have some evidence that this assumption could be made for the PWV values. The evidence for
symmetry is mildly compelling, but overall, our strongest evidence is that there is not an
indication of extreme outliers or obvious skew. In the absence of worrisome outliers and skew,
the assumption can be entertained. We will see later that for sample sizes as small as ours, the
use of the t distribution poses little problem when descriptive statistics indicate a lack of extreme
skew.

Thus, in order to test


H 0 :   6.6
H A :   6.6

we will assume that the sample of size n  18 PPV values represents a random sample from a
normally distributed population (at least approximately symmetric without evidence for skew
and multiple modes). Since our parameter of interest is the population mean  , we will estimate
it with the statistic X . Once we studentize X , we have a test statistic which is

43 | P a g e
X 
t
s n

and the distribution of this statistic when the null hypothesis is true is t with n  1 degrees of
freedom.

Characteristics of t Random Variables

The density function for a t random variable shares a lot of similarities with that of a standard
normal random variable. This, in part, is one reason why it took until 1906 for someone like
Gossett to make his discovery. To the naked eye, the density curves can appear interchangeable.

 The density function for a t random variable is always unimodal, bell-shaped and
symmetric around zero.
 The outcomes from a t random variable are always more disperse than outcomes from a
standard normal random variable. This is graphically shown by the “tails” of the t
density function being “fatter” than those of the standard normal density function. See
the pictures below.
 As the degrees of freedom of a t random variable increase, the t random variable has a
density function that behaves more and more like that of a standard normal density curve
(statisticians call this convergence in distribution)

The following picture illustrates the similarities and differences between t density curves and a
standard normal density curve.
0.4

0.3

0.2

0.1

3 2 1 1 2 3

There are four density curves plotted here and at this scale, they are only mildly distinguishable.
The curve which hits highest on the vertical axis is the standard normal curve. The curve that
hits lowest on the vertical axis is the t density with three degrees of freedom. Between these two

44 | P a g e
curves the t density with 8 (second lowest hit on the vertical axis) and with 20 (second highest
hit) degrees of freedom are also plotted. Notice that at approximately x  1.75 all of these four
curves "meet up", but then the tails of the distribution have differences beyond these boundaries.
To see the differences between these four density curves better, it would be wise to examine just
the tail areas. The plot below focuses on just the right-hand tail of the four density curves.

From this picture it is clear that the curve with the “fattest” tail – that is, having most area in the
tail – is the t density curve with 3 degrees of freedom (df). The t density with 8 df has less area
in the tail than does its counterpart with only 3 df. Finally, the t density with 20 df is still “fatter”
than the standard normal density tail (Z), but only slightly so. Imagine the tails of the t densities
for which df  20 . All of these curves would lie between the curve marked df  20 and Z. This
reinforces the fact that t densities with this many degrees of freedom are very similar to the
standard normal density.

An Additional Consideration In Hypothesis Tests: Type I and II Errors

From prior statistical investigations, recall that the lower the p-value, the more evidence the data
contains for the alternative hypothesis. In Case 1B, the calculated p-value was so close to zero,
that without much consideration it was decided that this was “small enough” to reject H 0 . But,
in general, the p-value is compared to a significance level or (  -level) in order to determine
whether rejection is warranted or not. The significance level is a threshold for evidence. When
the p-value falls below it, we state that the information seen in the sample is statistically
significantly different from what would be expected to have been seen under H 0 . Thus, when
the p-value is less than  , H 0 is rejected. If the p-value is equal to or above  , we fail to reject
H0 .

A significance level is chosen in advance of data collection by the investigator. It could be that
different people invested in the problem would individually choose different significance levels.

45 | P a g e
However, for the purpose of scientific communication or published research, one value is
generally agreed upon.

The significance level should be based on the relative consequences of what are termed Type I
and Type II errors.

Type I error: Making the decision to reject the null hypothesis when, in fact, the null
hypothesis is true.

Type II error: Making the decision to not reject the null hypothesis when, in fact, the
alternative hypothesis is true.

Significance Level   - the probability of committing a Type I error.

Because the null distribution gives the distribution of the test statistic when H 0 is true, the
probability of a Type I error can be definitively calculated. On the other hand, when H 0 is false,
we do not know the exact distribution of a test statistic. This distribution depends upon the
degree to which H 0 is false – a quantity we cannot ascertain. Thus, the value of  can be
chosen based upon the consequence of making a Type I error, but the chance of a Type II error
cannot be calculated using the null distribution.

If the consequences of committing a Type I error are high, then the significance level should be
set low in order to protect against the possibility of making a faulty statistical decision. Despite
the difficulty of directly calculating the chance of a Type II error, it can be indirectly controlled.
It can be shown that, in general, as the Type I error probability increases, the Type II error
probability decreases and vice versa.

Through many different applications, scientists have determined that choosing an  -level at or
below 10% is appropriate in the vast majority (but, not all) of statistical inference problems.
When  -levels are chosen higher than 10%, then researchers are of the general opinion that
rejection of the null hypothesis has become too easy and the scientific standard for statistical
significance eroded. It should not be overly easy to reject H 0 . We do not want to artificially
reject null hypotheses on the basis of little evidence.

As a rough guide, investigators that feel the consequence of making a Type I error are
particularly severe, typically choose   1% (   .01 ). On the other hand, when Type II errors
are considered most dire, experimenters often choose   10% (   .10 ). In the cases that both
errors are seen as equally serious, or there is little way to know the effect of the errors, then a
general compromise is struck and the significance level chosen to be near 5%. This last scenario
tends to be particularly popular. Many hypothesis tests are based on the choice of   .05 .
While this is often the standard value of choice, it is important to remember that fear of one
particular type of error over another should be enough to move one off this default level.

46 | P a g e
Case Question 2A Concluding Decision

Since we desire to test

H 0 :   6.6 (HPGS children have normal PPV values on average)


H A :   6.6 (HGPS children have abnormal PPV values on average)

a Type I error is the claim that HGPS children have abnormal PWV value when, in fact, they do
not. If this error were made, then HGPS children may be routinely treated for a heart condition
unnecessarily or expensive protocols developed for HGPS children that are wasteful.

A Type II error in Case 2A would consist of failing to reject H 0 and thus claim that HGPS
children have normal PPV values on average, when in fact, these children tend to have abnormal
vascular stiffness. This error might cause caretakers to assume children don’t have a condition
when in reality, they do. Making a Type II error could lead to the absence of a protocol – and
lack of treatment – when, in fact, children with this condition should be treated or preventative
action taken.

Clearly, researchers would like to avoid making either of these errors. But, we should
acknowledge that in all hypothesis tests, the chance of making one exists. We have only a
sample and not the entire population at our disposal. Statistics is the science of making decisions
in the face of uncertainty using relevant data. At all times in statistical investigations we are
working from knowledge of part of the population. This always keeps some level of uncertainty
in play. The issue is the degree of that uncertainty and thankfully, we have the p-value to
measure that degree.

Assessing the severity of the two errors, an experimenter should choose an  -level prior to data
collection. Henceforth, we proceed believing that this was done before our statistical
development commenced.

p-value: The chance of observing the value of the statistic from your sample (or one more
extreme) if, in fact, the null hypothesis is true.

The value of the statistic observed from our sample is

X   12.44  6.6
t   6.81 .
s n 3.64 18

Since the alternative hypothesis is one-sided and upper tailed, the p-value for our test is the
chance that we observe an outcome from a t distribution with n  1  17 degrees of freedom to be
greater than or equal to 6.81. We represent this with the symbols:

p-value = P  t17  6.81 .

For all continuous random variables, probabilities such as this one are found by obtaining areas
under the relevant density curve. For normal density functions, we have seen how this can be

47 | P a g e
done using statistical tables or functions in software such as Excel. The procedure is no different
for t random variables. We will use either a t-table or Excel commands to facilitate the
calculation. Software is particularly recommended in this case, since t-tables can only give
bounds on the p-value. Use of t-tables shows that our p-value is less than .005.

Typically, t-tables provide “cutoffs” required for certain right-tail areas under the density curve.
Left-tail cutoffs are just the negative of the right-tail cutoff since all t density curves are
symmetric around zero. From t-tables, the cutoff for a p-value equal to .005 in the right tail of
the t17 distribution is 2.898. Our test statistic value is quite larger than this. This is really all that
is needed in order to make our statistical decision. With a p-value lower than .005, which is ½ of
1%, it is clear that the p-value will be lower than whatever significance level was set (1%?, 5%?,
10%?). So, rejection of H 0 is warranted. The data provide sufficient statistical evidence to
conclude that HGPS children have abnormal PWV values on average.

The Excel command required to calculate our p-value is


=T.DIST.RT(6.81, 17)

When this code is typed in a cell the result is p-value  P  t17  6.81  .00000152 . So, to repeat,
this is lower than the pre-set significance level and so the data suggest rejecting H 0 .

In general, use the following code for t distribution Excel calculations:

Desired Area Code to Paste in Cell

Area in Left Tail (left of x) =T.DIST(x, df, TRUE)


Area in Right Tail (right of x) =T.DIST.RT(x,df)
Area in Both Tails (to left of –x AND to right of x) =T.DIST.2T(x, df)

Originally, a simple visual inspection of the data gave the impression that the evidence is strong
in favor of the conjecture that HGPS children have abnormal PWV values on average. We
needed statistical evidence to make this claim scientifically sound. Through the use of
descriptive statistics, reasonable assumptions, the introduction to the t-distribution as a sampling
distribution, and calculation of the appropriate p-value, we have provided sufficient statistical
evidence backing up our conclusion. In addition, we saw how Type I and II errors, along with
significance levels, have a place in the hypothesis testing procedure. The test that has been
developed in conjunction with Case Question 2A is called the one-sample t-test for a population
mean.

Confidence Interval For a Population Mean Based on the One-Sample t-test

With the term significance level defined earlier, we can now connect this term to confidence
coefficients associated with confidence intervals for parameters. Typically, if an investigator
would have chosen the significance level of a test to be  , then the corresponding confidence
coefficient is 1   . This link is mathematical in origin. The boundaries associated with values
48 | P a g e
of the test statistic that would lead to the retention of H 0 are directly used in constructing a
confidence interval for a parameter. Said another way, if (by definition) the chance of making a
Type I error is  , then by the complement rule, the chance of deciding to retain H 0 if the null
hypothesis is true is 1   . This mathematical fact is useful for obtaining confidence interval
formulas.

This correspondence in place, it is certainly true that analysts who desire a confidence interval
for a parameter do not always consider Type I and Type II error repercussions. Instead, they
understand that all other quantities held constant, a confidence interval grows in width as the
confidence coefficient increases.

In addition, with few exceptions, confidence intervals for parameters are two-sided. They have
both a lower and upper bound. To this point, we have formally considered only one-sided
hypothesis tests. Scanning back over previous case study questions, the alternative hypothesis
considered have either been upper tailed or lower tailed. Two sided alternatives for hypothesis
tests are considered in Case 2B. A 95% confidence interval for a parameter is mathematically
linked to a two-sided test for that parameter using   .05 . Again, this link will be explored in
Case 2B.

We begin our confidence interval development for a single population mean the same way we
did while examining proportions in Case 1B. First, we can write a probability statement based
on the following picture.

For the PWV data set from Case 2A, we can imagine this curve representing the density of a t17
random variable. Using t-tables or Excel, it can be deduced that with   .05 ,

P  2.11  t17  2.11  .95 .

We know that if the random variables X1 , , X18 are normally distributed, then the test statistic

X 
t
s n

49 | P a g e
has t-distribution with 17 degrees of freedom. Inserting this expression for t into the probability
statement, we can now write

 X  
P  2.11   2.11  .95 .
 s n 

As we saw with the development for proportions in Case 1B, the inequalities inside the
parenthesis can be algebraically rearranged. This algebraic rearrangement will not change the
truth of the probability statement; instead, it will simply make the expression look different.
Rearranging the above expression can produce


P X  2.11 s
n
   X  2.11 s
n   .95 .
The endpoints of this rearranged expression give us the formula for the 95% confidence interval
for the population mean PWV value among HGPS children. Our 95% confidence interval in this
case is

X  2.11 s n . 
Of course, from earlier study we know that x  12.44, s  3.64 and n  18 for the PWV sample.
Inserting these values gives the 95% confidence interval for the population mean to be
10.63, 14.25 . Based on the sample of 18 PWV values, we believe that a reasonable range of
guesses for the average PWV value in the HGPS population is 10.63 m/s to 14.25 m/s. We say
that we are “95% confident” that the value of  is between 10.63 and 14.25.

Remember that  is a parameter. Thus, it is an unknown constant that we will never know
exactly unless we observed EVERY member of the population. For this reason, we do not speak
of a “95% probability of  being between two values”. Such a phrase is just as nonsensical as
saying that there is a “95% chance that 5 is between 6 and 7”or that there is a “95% chance that 5
is between 1 and 2”. Either the true, unknown value of  is or is not between 10.63 and 14.25.
Based on our sample, it is reasonable to estimate  to be between 10.63 and 14.25. Since  is
a constant, we refrain from using the word “probability” in reference to  . The “confidence”
that we have actually isn’t in the two numbers that comprise our confidence interval.
Interpreting a confidence interval is never about you or about the results that you personally
obtained from your sample. Instead, the confidence is in the mathematical procedure that has
been outlined above.

To interpret the confidence interval, imagine that 100 friends each had collected a data set of
the same size as that of the PWV sample. Then “95% confident” means that if all 100 friends
follow the procedure outlined above while using their sample data, then we could expect 95% of
our friends to generate a confidence interval that would include the true, unknown value of  .
This leaves 5% of our friends to have an interval that would not include the true, unknown value
of  . This is as far as the interpretation can go since  is unknown (we are trying to estimate
it), we won’t know which friends trapped  in their interval and which friends did not.

50 | P a g e
However, despite not knowing who got it and who didn’t – we are confident and expect that 95%
did.

The entire development above can be repeated with a curve representing the density of a t
random variable with an arbitrary degrees of freedom, say, df. The values “2.11” and “-2.11” are
replaced with the values that secure an area of  2 in each tail. Generally, these values are
denoted t 2,df and t 2,df . Repeating the creation of a probability statement based on the
picture, the inserting of the t-statistic and the isolation of  algebraically gives the following.

Suppose a random sample of size n is collected on a variable. Suppose the population of


possible outcomes for this variable has a normal distribution with mean  and standard
deviation  . If  is unknown and replaced by the sample standard deviation s, then the
100 1    % confidence interval for  is given by


X  t 2,df s 
n .

Here, the degrees of freedom are df  n  1 .

Concepts and Terms To Review from Case Question 2A:


Population percentile
Sample first quartile
Random sample third quartile
Random variable boxplot and whiskers
Parameter null hypothesis
Statistic alternative hypothesis
Descriptive statistics test statistic
Measure of center null distribution
Measure of spread standardization of X
Sample mean studentization of X
Sample standard deviation t distributed random variable
Sample median characteristics of t random variables
Outlier significance   level
Robust statistic Type I error
Histogram Type II error
Bin frequency p-value
Bin relative frequency one sample t-test for a population mean
Uniformly distributed population confidence interval
Normally distributed population confidence coefficient
Skewness upper tailed test
Skewed right lower tailed test
Skewed left 95% confidence interval for a population mean
Bimodal interpretation of a confidence interval
Symmetric distribution

51 | P a g e
Case Study #2B: GPA of Students in Introductory Statistics
Introduction

Is there any reason to believe that the GPA’s of students taking statistics is any different from the
university as a whole? Could there be some factor or set of factors that draws students with
grade point averages lower or higher on average than the general population into introductory
stats classes? Not everyone takes statistics in college, but among those who do, are their GPA’s
different than the average GPA of all students? Implicitly, it is commonly believed on most
campuses that some courses are easier than others. This almost has to be true by definition.
However, are the GPA’s of students required to take introductory statistics really any different
than the average overall?

Case Question 2B

A statistics instructor used an online databank to record the GPA’s of the 90 students that had
enrolled in his introductory statistics class over the last two years (2013-2014). The GPA’s were
collected at the time the student began the course and the average of the 90 GPA’s was 2.733
with a standard deviation of .762. The most recent report (2012) in university literature states
that the average GPA of all university students having completed at least one semester in
residence at the university is 2.889. Is the GPA of students enrolling in introductory statistics
statistically significantly different than the university as a whole?

A set of descriptive statistics is provided below concerning the sample of 90 students. The
histogram and summary statistics were calculated in JMP.

Histogram of 90 GPA’s in Instructors’ Classes

52 | P a g e
Quantiles
100.0% maximum 4
99.5% 4
97.5% 3.89608
90.0% 3.6935
75.0% quartile 3.356
50.0% median 2.7425
25.0% quartile 2.198
10.0% 1.9094
2.5% 0.75858
0.5% 0.36
0.0% minimum 0.36

Summary Statistics
Mean 2.7326667
Std Dev 0.762033
N 90

Population and Sample

Care is required when defining the target population in Case 2B. The question is asking whether
or not those enrolled in statistics courses have a GPA that can be considered equivalent to the
entire university. This does NOT mean that the population of interest is the entire university.
Ask yourself: What group is being sampled here? The group being sampled are those students
who enroll in an introductory statistics course. Therefore, our target population is the set of all
students enrolling in introductory statistics at the university.

The sample is comprised of 90 students that we have GPA information for over the last two
years. Several reasonable questions surface when considering the population and sample for
Case Question 2B. First, the 90 students all had the same instructor. Is it believable that these
90 students represent a random sample of all students enrolling in the course over the period of
study? On average, would we believe that students with differing GPA’s would to tend to enroll
with specific instructors? This sounds like an entirely different study altogether – and it could be
interesting to explore.

However, the reasons that students enroll in particular sections are many: time of course, time at
which they have other courses, work schedules, friends’ schedules, and just plain apathy toward
the sections. Some students may hand pick an instructor, but other students may not. Despite
this, do we have any direct evidence that this would create a GPA imbalance across instructors?
While the question of whether or not the 90 students constitute a random sample of students
taking introductory statistics is interesting, we continue under the assumption that the sample is
representative of the population.

Random Variable, Parameter and Statistic of Interest

Each student in the population can be paired up – or matched with – their GPA. This is the act of
defining a random variable. Our random variable has been observed 90 times resulting in the 90
observations which are represented in the histogram above. So, our random variable of interest

53 | P a g e
is the GPA of an introductory statistics student at the university. A student’s GPA is computed
based on course grades that correspond to grade points. These grade points are then weighed by
the number of hours of credit the course is assigned. GPA is a variable that – strictly speaking –
takes on only a finite number of possibilities. However, because GPA is usually reported
utilizing several decimal points and conceptually could span virtually any value between 0.0 and
4.0, it makes sense to treat this random variable as continuous.

Quite often, a variable that technically can only take on a (large) finite number of values, is
analyzed as though the variable could result in uncountably many outcomes. This is done out of
convenience in our case. It does not appear to be a good use of time to try and consider “all” the
possible GPA values and then treat the variable as though it is discrete. Contrast the variable
“number of courses taken in a semester” with the GPA variable. The first of these is a variable
for which it is trivially easy to list the possible values. The second is not. Hence, from a
modeling perspective, we treat GPA as a continuous random variable that can result in any value
between 0.0 and 4.0.

Case Question 2B, like Question 2A revolves around the average value from a population. In
this case, we are concerned with the average GPA in the population. Thus, our parameter is the
average GPA among students at the university which enroll in an introductory statistics class.
This value is unknown. There are many more than 90 students across the period of time in
which the data were taken that enrolled in the course. Only a subset of the population has
actually been observed. In this subset, we know the sample mean GPA. It is 2.733.

Let  = the mean GPA of students at the university that enroll in an introductory statistics
course. This parameter (the population mean) is estimated by our statistic of interest: the sample
mean of the 90 students from an instructors’ course. The sample mean is represented by X and
we have observed the value of the sample mean to be x  2.733 .

Hypothesis to Be Tested

In Case Study 2A, we developed the one sample t-test for a population mean. Case Question 2B
is also an inference problem concerning a population mean. The focal question is whether or not
 can be inferred to be different that the value of 2.889 which represents the average GPA for
all students at the university. Notice, that the focal question is not concerning with a specific
direction. That is, we are interested in whether or not  is different from 2.889, not whether the
population mean is specifically larger or smaller than 2.889. This is an example of a two-sided
alternative hypothesis. When considering two-sided hypotheses, the experimenter is focused
on finding a difference from a hypothesized value, but is not interested in testing for a specific
direction. Thus, rather than interest solely on whether the data suggests that   2.889 or
  2.889 , we are interested in whether or not there is sufficient evidence to reject   2.889
and instead conclude   2.889 .

54 | P a g e
The statement that   2.889 and the statement that   2.889 are one-sided hypotheses. The
statement that   2.889 is two sided. Therefore, for Case Question 2B, we desire to test the
following null and alternative hypotheses:

H 0 :   2.889 (the average GPA among students taking introductory statistics is 2.889)
H A :   2.889 (the average GPA among students taking introductory statistics is NOT 2.889)

Aside from the fact that the alternative hypothesis is two sided, Case 2B shares a fundamental
similarity with Case 2A. Both cases wish to investigate a single population mean of a random
variable. Because of this, it is natural to inquire as to whether the t-test from Case 2A would be
an appropriate statistical technique to use for the solution of Case Question 2B.

Study of Case 2A provided us with a theorem that tells us how to studentize X . This
studentization yields the test statistic

X 
.
s n

However, the theorem only prescribes to us the distribution of this quantity if the population we
are sampling is normally distributed. This is quite important. Very carefully review the “if” and
the “then” portions of the theorems in Case 2A. IF X1 , , X n are normal random variables,
THEN  X     s n  has a t distribution with n 1 degrees of freedom. What is the
distribution of  X     s n  if X , , X are NOT normal random variables? The answer to
1 n

this is “we do not know”. No theorem presented thus far has addressed that case.

When analyzing the data from Case 2A, we assumed that the 18 observations came from a
normally distributed population. There was some evidence for this assumption, but due
primarily to the low sample size, it was not overwhelming. Histograms looked roughly
symmetric and there were no extreme outliers to destroy the assumption. We progressed largely
on the belief that there was no demonstrative evidence against normality more so than on direct
evidence that we were sampling a normal population. Case 2B is different from Case 2A on all
of these points.

While still focused on a population mean, Case Question 2B is associated with a sample size of
90. We have five times as much data in Case 2B as opposed to Case 2A. Secondly, the
histogram of the sample is clear. The data are skewed to the left. The histogram has many stair-
stepping bars to the left of the mode. The left-hand whisker in the box and whisker plot is much
longer that the whisker on the right. JMP even indicates a possible outlying observation far to
the left of the whisker in the plot. The sample mean is less than the sample median. All of these
pieces of information – taken alone – are not definitively convincing. However, all of this
information viewed as a collective creates a clear picture of GPA’s in the sample. The GPA’s in
the sample are NOT symmetric and therefore it does NOT appear reasonable to assume that the
population being sampled is normally distributed.

55 | P a g e
In Case 2A, we could at least imagine that a larger data set could have smoothed the bars of
resulting histograms to the point where a symmetric, unimodal curve would emerge. Here, we
need not imagine. We have the data to make this assessment. The ninety data points in the
sample point to a skewed population rather than a symmetric one. Thus, our previous theorems
do not apply.
Despite evidence for a skewed population, it still seems reasonable to attempt to use the test
statistic

X 
.
s n
After all, we know the value of X . The standard deviation of the sample was calculated to be
s  0.762 and n  90 . It is important to recall that a hypothesis test requires BOTH a test
statistic and a null distribution in order for statistical inference to be made. At this point, we
have retained our reasonable test statistic, but because the random variable being measured has a
skewed distribution, we currently do not know the distribution of

X 
.
s n

The Benefit of a Large Sample

One obvious benefit of having a sample size of 90 as opposed to 18 is that we have more
information by which to make our decision. A random sample of size 90 contains more
information than a random sample of size 18. Be careful, however: a sample taken in a biased
way isn’t necessary better just because it is based on a larger sample size. First and foremost, the
collected sample should be representative of the target population. Once that is established,
increasing sample size provides more power by which to make statistical decisions.
Another benefit of taking a large sample is that the value of the sample standard deviation has a
greater chance of being near the true population standard deviation. Statisticians call this
convergence in probability. As n grows, s has greater and greater chance of being near  .
Written succinctly, for large sample samples, s   . This is not necessarily true when dealing
with small samples, but effectively, it solves part of our problem in the presence of a large
sample. The two theorems presented in Case 2A differ on the point of whether or not  is
known. This point is rendered relatively moot when we have taken a large sample. In these
cases, we may replace  with s and incur no substantial mathematical “penalty”. Then again,
this still doesn’t provide us with a null distribution for our test, because the theorems only pertain
to normal random variables – regardless of whether we use s or  in the test statistic.

What do we know about the distribution of X and about the characteristics of t random
variables? Let’s chain some facts together:

 If X1 , , X n are normally distributed with mean  and standard deviation  , then


X 
Z
 n

56 | P a g e
is standard normally distributed.
 If the population standard deviation   is unknown and replaced by the sample standard
deviation  s  , then if X1 , , X n are normally distributed, we know that
X 
s n
is t distributed with n  1 degrees of freedom.
 As the degrees of freedom of a t random variable increase, the t random variable has a
density function that behaves more and more like that of a standard normal density curve.
So, when n is large, then if X1 , , X n are normally distributed, we have that
X 
s n
is approximately standard normally distributed.

The sum total of the above three bullets is that if X1 , , X n are normally distributed, and n is
large, then X has at least an approximate normal distribution. The big question….the really,
really big question is:

Big Question: What is the distribution of X if the random variables we observe are not
necessarily normally distributed?

We must answer this question in order to finish the analysis of Case Question 2B. Our
population is skewed. At present, we do not even approximately know the null distribution for
our test statistic. Thus, we are stuck. Until we obtain a null distribution, we cannot obtain a p-
value or construct a confidence interval that would help answer Case Question 2B.

The Central Limit Theorem

To investigate our “Big Question”, consider three different populations all of which are
definitively non-normal. The first is skewed right while the second is uniform. The third
population has two modes and is not symmetric. If we knew that we had collected a random
sample from either of these three populations, then the theorems presented while studying Case
2A would not apply. None of these three populations is normally distributed.

Remember that in order to conduct a hypothesis test or formulate a confidence interval we need
to ask ourselves two questions:

 What other values of X could we have obtained in other potential samples?


 How likely are the these values of X (if the null hypothesis is true)?

57 | P a g e
1.0
2.0

0.8

1.5

0.6

1.0
0.4

0.5
0.2

1 2 3 4 5 0.2 0.4 0.6 0.8 1.0

Skewed Right Uniform

0.4

0.3

0.2

0.1

1 2 3 4 5

Bimodal and Not Symmetric

One way to investigate the answers to these two questions for the three populations represented
above is to simulate observations from the populations and study the behavior of X in the
samples that result from the simulation. Statisticians have discovered mathematical ways to
create fictitious data sets that mimic all of the features of real data sets that arise in applications.
These simulated data sets allow us to understand the overall behavior of samples that may, in
fact, emerge in real investigations. The simulated data sets are a scientific tool only. A tool that
illuminates mathematical properties so that we will be better equipped to understand the real data
sets that are collected in case studies such as the one we are exploring.

To get a feel for the answers to the two bulleted questions above, we can simulate (using a
computer algorithm) random samples from each of the populations. Then, once we have a
simulated random sample, we can compute the sample average. If we repeat this many times
(which is what a computer is great for!), then we can make a histogram of all the X values that
we obtained. This histogram will be an approximation to the actual sampling distribution of X .
If we simulate a large number of random samples, then this approximation will be quite accurate.
In this way, we can get a handle on the answer to the “Big Question” for our three non-normal
populations. Maybe the answer to the “Big Question” will be different for all three populations.
Then again, our simulation may reveal a trend that is common in all of the cases.

58 | P a g e
For each of the three populations (skewed right, uniform, bimodal/non-symmetric) 10,000
random samples were simulated and for each of these samples, X was calculated. The resulting
10,000 values of X were then placed in a histogram. This entire process was done for samples
of size 4, 10, and 50. The resulting approximate sampling distributions represented in the
histograms are presented on the following page.

The sampling distributions for X based on samples where n  4 show a fair amount of
difference. In the case that the population is skewed right, the distribution of X is again skewed
to the right. The other two cases reveal histograms that are roughly symmetric – more so for the
uniform population than the bimodal/non-symmetric one. Nevertheless, the distribution of X
appears to have different shapes when the sample size is just four. The same conclusion appears
true when n  10 . However, it is important to notice that when the population is skewed right
the distribution of X when n  10 is less skewed than in the case that n  4 .

In the case that the population is uniform or bimodal/non-symmetric, the sampling distribution
for X when n  10 is essentially symmetric and unimodal. One might look at these histograms
and state that the sampling distribution for X when n  10 is roughly normal. Finally, look at
the column of histograms when n  50 . These three pictures have a more similar shape than any
of the previous two columns based on lesser sample sizes. When the population is skewed right,
the distribution of X when n  50 has virtually all of its skew removed. At this point, it would
be reasonable to state that X is approximately normal. This statement is even more definitive
when the populations are uniform or bimodal/non-symmetric. In these two cases, when n  50 ,
the distribution of X is very well approximated by a normal curve. The histograms in these two
cases are almost exactly symmetric and unimodal while being bell-shaped.

The general conclusion from these histograms is that no matter the shape of the original
population, the sampling distribution for X becomes more and more shaped like a normal
distribution as the sample size increases. This is regarded as the most important phenomenon in
all of probability theory. It is known as the Central Limit Theorem. The Central Limit Theorem
is the backbone of virtually all elementary statistical inference problems where the experimenter
has collected a large sample. The Central Limit Theorem (CLT) will be stated more formally
later. For now, it is summarized below.

Central Limit Theorem (Summary): No matter the features of the population being sampled, the
sampling distribution of the statistic X is approximately normal when the sample size is large.
As the sample size increases, this approximation becomes better and better and the standard
deviation among X values becomes less and less.

Look at the horizontal axes for each row of pictures. The sampling distribution has smaller and
smaller spread as the sample size increases in each case. For instance, in the case that the
population is bimodal and non-symmetric, the bulk of the distribution for X when n  4 falls
between 1 and 5. When n  10 , the bulk of the distribution falls between 1.5 and 4. Finally,

59 | P a g e
Sampling Distributions for X From Skewed Right Population

Sampling Distributions for X From Uniform Population

Sampling Distributions for X From Bimodal, Non-Symmetric Population

60 | P a g e
when n  50 , the distribution for X lies almost entirely between 2 and 3.5. This is visual
confirmation that averaging reduces variability.

Recall our previous theorem from Case 2A regarding sampling from a normal distribution with
the population standard deviation known:

Theorem: Suppose a random sample of size n is collected on a variable. If the population of


possible outcomes for this variable has a normal distribution with mean  and standard
deviation  , then the sample mean, X also has a normal distribution with mean  , but with
standard deviation  n.

The final statement in this theorem is actually true for all populations. That is, in general, the
mean of X is  and the standard deviation of X is reduced to  n . The fact that the mean
of X is  in general can be seen in the plots above. The mean for the skewed right population
is 1. For the uniform population, it can be shown to ½. Finally, in the bimodal and non-
symmetric population, the population mean was set to be 2.8. Now, if we visually investigate
where the sampling distributions for X “balance”, we will find that the balancing point for the
distributions in the first row of plots on the previous page are at 1. Likewise the histograms in
the middle row all visually balance at ½ and those in the final row balance at 2.8. Finally, recall
no matter the shape of the original population, the sampling distribution for X becomes more
and more shaped like a normal distribution as the sample size increases. Putting all of this
together allows for a formal statement of the Central Limit Theorem.

Central Limit Theorem (Formal): Suppose a random sample of size n is collected on a variable.
If the population of possible outcomes for this variable has mean  and standard deviation  ,
then, provided n is large, X has approximately a normal distribution. The mean of X is  ,
and the standard deviation of X is  n.

The reader should very carefully compare the CLT with the Theorem stated right above it.
Notice that when the population we are sampling from is normal, then X has (exactly) a normal
distribution as well. But, provided the sample size is large, X will have an approximate normal
distribution no matter the shape of the original population. In all cases, the mean of X is  ,
and the standard deviation of X is  n.

There is one scenario in which the distribution of X is not even approximately normal. We can
see from the plots above that this situation is when the population we are sampling from is
skewed and the sample size is small. The combination of skewed data and small sample sizes
must be handled with care. The distribution of X could be far from normal in these cases.
Analysis of data falling into this scenario is generally handled in advanced statistics courses as
opposed to a first course.

61 | P a g e
Test Statistic and Sampling Distribution

The “Big Question” has been answered with the Central Limit Theorem. If the random variables
we observe are not normally distributed, then the distribution of X is approximately normal
when the sample size is large. The histograms generated from the simulation experiment show
us that “large sample” is a relative term, but in Case Question 2B, our sample size is 90. All of
the sampling distributions for X that have been explored when n  50 appear almost exactly
normally distributed. So, we have little concern about applying the CLT to our case where
n  90 . Thus, in order to test
H 0 :   2.889 (the average GPA among students taking introductory statistics is 2.889)
H A :   2.889 (the average GPA among students taking introductory statistics is NOT 2.889)

we will invoke the CLT with n  90 which allows us to state that the statistic

X 
 n

is approximately standard normal. In addition, it has already been argued that when n is large,
s   . Therefore, the test statistic
X 
s n

can be used along with a standard normal null distribution. The p-value that results from using
this null distribution will be approximate, but this approximation will be very good since the
sample size is large.

Case Question 2B Concluding Decision

Using the data from Case Question 2B, the value of the calculated test statistic is

X   2.733  2.889
  1.94.
s n .762 90

To compute the p-value of the test, we need to use the standard normal distribution.

p-value: The chance of observing the value of the statistic from your sample (or one more
extreme) if, in fact, the null hypothesis is true.

If the alternative hypothesis had been H A :   2.889 , then our p-value would be P  Z  1.94  .
Likewise, using the definition of p-value, if we had H A :   2.889 , then the p-value calculation
would be P  Z  1.94  . Our alternative hypothesis is two-sided in Case 2B and so the p-value
must reflect the area in two tails of the null distribution. To make the proper p-value calculations
associated with a two-sided alternative, we determine which tail the test statistic fell into and
62 | P a g e
then find the one-sided p-value using that tail. Finally, this area is multiplied to two to obtain the
final p-value.

For us, the test statistic fell in the lower tail of the null distribution  z  1.94 . Thus, the
appropriate p-value is
p-value = 2 P  Z  1.94  .

Using Excel or tables of the standard normal distribution tells us that our p-value is
approximately .052.

The discussion from Case 2A showed us that the null hypothesis should be rejected only when
the p-value is sufficiently low – lower than a pre-set significance level   . The value of 
should be chosen to reflect the experimenters concern over the impact of committing a Type I
error.

Type I error: Making the decision to reject the null hypothesis when, in fact, the null
hypothesis is true.

Type II error: Making the decision to not reject the null hypothesis when, in fact, the
alternative hypothesis is true.

Significance Level   - the probability of committing a Type I error.

Since the p-value in Case 2B is approximately .052, the experimenter’s decision could be
entirely dependent upon conversations about the consequences of making Type I and Type II
errors held prior to collecting the data. If  had been set to .10 (reflecting more of a concern for
a Type II error), then the data suggests rejecting the null hypothesis. If  had been set to .01
(reflecting more of a concern for a Type I error), then the data is not suggestive of rejecting H 0 .

Recall, different people tend to make different decisions in real life based on different fears and
concerns of being “wrong” in particular ways. For instance, one person may avoid riding roller
coasters due to a fear of heights, getting ill, or the coaster breaking. Another person may
gravitate toward riding them over the concern of missing out on the fun and thrill of the
opportunity. Different people will look at the exact same situation with varying concerns. So it
is with hypothesis testing. This is the power of the p-value – it allows experimenters to make
decisions with the personal evaluation of consequences of making an error built in to the process.

But, what if the experimenter had set the significance level to be .05 in advance of data
collection? What are we to do with a p-value that is so close to  ? On one hand, technically
the p-value is larger than  , so we are trained to refrain from rejecting H 0 in this case. On the
other hand, we must realize the null distribution is approximately normal and so the p-value is an
approximation as well. It is a good approximation, but still, should we treat an approximate p-
value of .052 different than an approximate p-value equal to .048? This seems unwise.

63 | P a g e
Statisticians have differing opinions about how to handle these borderline cases. One possible
solution is to return to the population and collect additional data and repeat the analysis. If this
can be done, then possibly the p-value based on even more data will lead to a more firm
conclusion. However, obtaining more data is not always a simple solution. Data can be costly.
Collecting data can be time consuming. There are many (most?) real-world situations where
experimenters cannot simply “collect more data”. That said, in the scenario of Case 2B it might
be entirely possible to do this. The instructor might go one year back in time or re-run the
analysis after he has taught a few more classes. This is at least a possibility, but in other
situations the data set used for a first analysis is all that the experimenter could afford or have
time to collect.

Many analysts will refrain from making definitive conclusions about hypothesis tests when the
p-value is very close to  . This can be frustrating to investigators that find themselves in this
position. But, one should realize that when data is collected, we are not guaranteed that the
weight of evidence for H 0 will be starkly different from the weight of evidence for H A . The
fate driving the balance of the scales between H 0 and H A rests entirely with the data itself. If
the data tends to create such balance, the experimenter is cautioned against artificially pushing
down on one-side or the other. Instead, it can be informative, albeit possibly disappointing, for
the investigator to suspend definitive judgment.

Thus, our decision as to whether or not the GPA of students enrolling in introductory statistics is
statistically significantly different that the university as a whole is driven by the pre-set  level.
In the case that this significance level is chosen near the p-value of .052, investigations
concerning the null and alternative hypotheses may be stated as inconclusive.

Confidence Interval For a Population Mean Based on the Z Null Distribution

Creating a confidence interval for the population mean when the sample size is large is a matter
of invoking the Central Limit Theorem. The mathematics behind the development of the t-based
confidence interval from Case 2A are similar to our current problem. However, since the null
distribution used in the test for one population mean based on a large sample size is
(approximately) standard normal, this change must be reflected in the formula for the confidence
interval. Recall that use of the t-based confidence interval requires the assumption of sampling a
normally distributed population. This previous result generating the t-based interval is repeated
now.

Suppose a random sample of size n is collected on a variable. Suppose the population of


possible outcomes for this variable has a normal distribution with mean  and standard
deviation  . If  is unknown and replaced by the sample standard deviation s, then the
100 1    % confidence interval for  is given by


X  t 2,df s 
n .

Here, the degrees of freedom are df  n  1 .

64 | P a g e
The Central Limit Theorem applies no matter the shape of the original population. Thus, when
desiring a large sample confidence interval for a single population mean, we can make
adjustments to the above formula and proceed similar to what was done in Case 2A.

Suppose a random sample of size n is collected on a variable and that n is large. Further,
suppose the population of possible outcomes for this variable has mean  and standard
deviation  . If  is unknown and replaced by the sample standard deviation s, then the
(approximate) 100 1    % confidence interval for  is given by


X  z 2 s n .
Here, z 2 is that value which places area equal to  2 in the right tail of the standard normal
distribution.

For reference, z 2  1.96 if   .05 , z 2  2.575 if   .01 and z 2  1.645 if   .10 .

Once again applying the CLT to the data from Case 2B, the approximate 95% confidence
interval for  = the mean GPA of students at the university that enroll in an introductory

statistics course is X  1.96 s 
n . Inserting the values x  2.733 , s  0.762 and n  90 we
get


2.733  1.96 .762 
90 .
This gives us  2.576, 2.890  as the (approximate) 95% confidence interval for  . Based on the
sample of 90 GPA’s, we believe that a reasonable range of guesses for the average GPA in the
population is 2.576 to 2.890. We say that we are “95% confident” that the value of  is
between 2.576 and 2.890. The reader should review the proper interpretation of a 95%
confidence interval which was discussed in both Case 1B and Case 2A. The “confidence” that
we have actually isn’t in the two numbers that comprise our confidence interval. Interpreting a
confidence interval is never about you or about the results that you personally obtained from
your sample. Instead, the confidence is in the mathematical procedure that generated the
expression for the interval.

Notice that the value from our previous null hypothesis, 2.889, is just barely inside the calculated
confidence interval. This is a reminder that at   .05 , our significance level was just barely
lower than the p-value ( .05  .052 ). Our reasonable range of guesses for  includes 2.889, but
it is near the right boundary of the confidence interval because the p-value of the associated test
is near the significance level   .05 .

65 | P a g e
Concepts and Terms To Review from Case Question 2B:

Descriptive statistics Central Limit Theorem


Histogram Simulation of random variables
Sample Mean Sampling distribution
Sample Median Balancing point of histogram
Population Test statistic
Sample Null hypothesis
Variable of interest Alternative hypothesis
Continuous random variable p-value
Parameter standard normal distribution
Statistic Type I error
Two-sided alternative Type II error
Normally distributed population Significance Level
Symmetric Confidence coefficient
Skewed Confidence interval for one mean
Unimodal 95% confident
bimodal
Sample Standard deviation

66 | P a g e
Case Study #3A: Sleep Deprivation in Young Adults
Introduction

Researchers have established that sleep deprivation has a harmful effect on visual learning. But,
is it possible to “make up” for being sleep deprived by getting a full night of sleep in subsequent
nights? Or, do the effects of being sleep deprived linger for days? Does quality rest in the time
following the deprivation prove to be enough to ward off learning difficulties? Recently, a team
of investigators utilized 21 young adult volunteers to conduct a sleep deprivation study to
address these questions. All of the volunteers were the age of typical college students - between
18 and 25 years old.

All 21 volunteers agreed to take a visual discrimination test on a particular “pre-testing” day.
During the test, they would be shown images that would flash on a computer screen and then
they would have to describe what they had seen. The 21 subjects were randomly broken up into
two groups. Eleven of the people were deprived of sleep on the night before the pre-test. The
other 10 were not. After the day of the pre-test, both groups were allowed as much sleep as they
desired for two straight nights and then on the following day the test was administered again.
This was called the post-test.

Information that the volunteers saw on the pre-test should prove useful for improving their scores
on the post-test. But, would the sleep deprived young adults be able to incorporate what they
learned by taking the pre-test and improve at the same level as those that had adequate sleep?
Or, would the sleep deprived group fail to capitalize on having taken the pre-test? If so, the
group with adequate sleep would make greater gains on the post-test than them.

Case Question 3A

For all 21 volunteers, the researchers measured how much better the score on the post-test was
than the pre-test. This score represents how much the subject learned by having seen the images
on the pre-test. Is the average gain made by the sleep deprived group the same as the gain made
by the rested group? Or, do the data that the researchers collected indicate that the gains made
by the rested group exceed those gains made by the sleep deprived volunteers on average?

When the volunteers took the tests, the researchers measured the shortest amount of time (in
milliseconds) between images appearing on a screen for which the person could accurately report
what they had seen. That is, the researchers were measuring a reaction time. An image flashes
on the screen and the volunteer has to describe accurately what is being shown. How long does
it take the volunteer to do this? This is the time being recorded in the experiment. Now, if an
image has been seen on the pre-test, when it appears again on the post-test, one would hope that
it could be recalled quickly. Thus, the second time around (three days later), one would hope to
be able to make a gain in reaction time.

Here are the gains made on the post-test by all 21 subjects for Case 3A:

67 | P a g e
Sleep Deprived Gains: -14.7 -10.7 -10.7 2.2 2.4 4.5 7.2 9.6 10.0
21.3 21.8

Rested Gains: -7.0 11.6 12.1 12.6 14.5 18.6 25.2 30.5 34.5
45.6

Let’s make sure we understand the data set. Those students with large positive values made the
most gains. The “45.6” in the rested gains portion of the data set indicates that this person
recalled the image and was able to describe it accurately as much as 45.6 milliseconds faster on
the post-test than on the pre-test. As a point of reference, all volunteers were found to have
baseline reactions times during the sample composition phase of the research to be between 30
and 100 milliseconds. So, an improvement of 45.6 milliseconds between the pre-test and post-
test should be viewed as a rather large gain.

All of the positive values in the data set represent improvements that the subject made in their
reactions times on the post-test. However, notice that some subjects actually did worse on the
post-test despite their familiarity with the test. The negative values in the data set represent those
volunteers doing worse on the post-test than on the pre-test. A value of zero in the data set
would represent no change whatsoever between the pre-test and the post-test.

Population and Sample

A closer look at the research team that organized this study reveals that the testing took place in
Massachusetts. In addition to the study that forms the basis for Case 3A, the researchers studied
other sleep related issues in adults between the ages of 18 and 25. Therefore, in general, we will
take the population under study to the general group of 18 to 25 year old adults. The reaction
time pre-test and post-test were conducted on a very small sample of this population – the 21
volunteer individuals. Thus, our sample consists of these 21 volunteers.

It would be fair to contemplate whether adults who volunteer for studies such as this could be
viewed differently than those who do not volunteer. That is, as always, we need to consider the
issue of whether or not our sample is representative of the population. Strictly speaking, the
researchers were not in a position to list all the members of the population and invite participants
into the study at random. It is unknown whether or not the volunteers were paid to be a part of
the study. If so, then this may be one way in which those in the study are different from the
population as whole. Nonetheless, is this difference – if it even exists – enough to create obvious
evidence that the 21 subjects are not reasonably representative of the population? Probably not.
This is another instance where investigators did not specifically choose the sample at random yet
still assume that the 21 volunteers behave as if they were chosen in such a way.

Case 3A is different from previous cases examined in that the general group of 18 to 25 year old
adults is being thought of as consisting of two populations. These two populations are being
compared. The researchers are interested in statistical inference about sleep deprived young
adults and rested young adults. In this way, our statistical problem concerns inference about two
populations. We may consider the 11 people in the sleep deprived sample as a random sample

68 | P a g e
from the a population of sleep deprived adults between the ages of 18 and 25. Likewise, we can
consider the 10 people in the rested sample as a random sample from a population of rested 18 to
25 year old adults. Ultimately, we wish to compare the visual learning from one population to
that of the other. This is a two sample statistical inference problem. One sample from each of
two populations.

Random Variables, Parameters and Statistics of Interest

The key question in Case 3A is whether the gains made in reaction times by the rested group are
larger than those gains made by the sleep deprived volunteers on average. Above, we have
defined the two populations of interest. To be a bit more formal, let Population 1 be the
population of sleep deprived 18 to 25 year old adults. Likewise, let Population 2 be the
population of rested 18 to 25 year old adults. The 11 observations in the sleep deprived sample
are considered to be a representative sample from Population 1. The gains made by each of these
11 individuals are outcomes from a continuous random variable.

Let X 1i be the gain in reaction time made by the ith  i  1 11 member of the sleep deprived
sample. Similarly, let X 2 j be the gain in the reaction time made by the jth  j  1 10  member
of the rested group. The first subscript on the “X” represents the population that was sampled
and the second subscript keeps track of the particular individual within the sample. Returning to
the key questions concerning the average of Population 1 and 2, we can define the following
parameters of interest for Case 3A.

Let 1 be the average gain (post-test value – pre-test value) in reaction time among 18 to 25 year
old adults that were sleep deprived.

Let  2 be the average gain (post-test value – pre-test value) in reaction time among 18 to 25
year old adults that were properly rested.

Previously, when the parameters of interest were population means, we used the sample means
as estimates. While the values of 1 and  2 are unknown, we can calculate the average reaction
time gains made among the individuals in our two samples. These sample means are two of the
statistics of interest in Case 3A. In order to estimate 1 , we will use X 1 and define this statistic
to be the sample mean of the 11 reaction time gains made by the individuals in the sleep deprived
sample. Similarly, let X 2 be the sample mean of the 10 reaction time gains made by those in the
rested sample.

Hypotheses To Be Tested

If there is truly any difference between the population means 1 and  2 , then it should result in
the values of X 1 and X 2 being different in such a way that could not be attributed to chance

69 | P a g e
alone. This is what our statistical procedure must investigate. We need to know whether or not
any observed difference between X 1 and X 2 can be chalked up to just random chance. If not,
then the data will have provided statistically significant evidence for the claim the gains made by
the rested group are larger than those gains made by the sleep deprived group on average.

The key question in Case 3A can now be translated in terms of 1 and  2 . If the average gain
made by the sleep deprived group is the same as the average gain made by the rested group, then
we would state that 1  2 . The researchers are interested in whether or not the evidence in the
data is suggestive of an alternative theory – namely, that the gains made by the rested group are
larger than the gains made by the sleep deprived volunteers on average. This alternative theory
should not be purported until sufficient evidence is presented to suggest it. We should not
believe there is a difference in the means of the two populations until data driven statistical
evidence suggests that this claim is reasonable. Thus, our null and alternative hypotheses for
Case 3A are:

H 0 : 1  2 (the average gain made by the sleep deprived group is the same as the average gain
made by the rested group)

H A : 1  2 (the gains made by the rested group are larger than the gains made by the sleep
deprived volunteers on average)

Test Statistic and Obstacles to Finding an Exact Null Distribution

In order to establish whether or not the data substantiate the claim that 1  2 , one strategy is
to rewrite the null and alternative hypotheses above. The null hypothesis can be represented as
H 0 : 1  2  0 . Thinking in this same way, if 1  2 , then the alternative hypothesis might as
well be written as H A : 1  2  0 . Next, since X 1 is a statistic which estimates 1 and X 2 is a
statistic estimating  2 , it makes sense to use the statistic X1  X 2 when trying to estimate the
difference in population means, 1  2 . If the alternative hypothesis is most reasonable, then
we would need to observe X1  X 2 to be "significantly negative". Of course, the issue is to try
and quantify what we mean by this term.

At this point, it makes sense to try and use X1  X 2 or some expression that makes use of this
quantity as a test statistic. Now, we should not reinvent the wheel just because we have two
samples. In Cases 2A and 2B, we wanted to investigate a single population mean,  . We used
the sample mean, X as an estimate and ultimately, our test statistics wound up being
standardized or studentized versions of X . Recall, the t-statistic for one the one-sample t-test
for a single population mean is
X 
t .
s n

70 | P a g e
It would make sense to try and standardize or studentize X1  X 2 in order to create a test statistic
that could facilitate resolution of Case 3A.

Recall a key theorem that was used in the solution of Case 2A:

Theorem: Suppose a random sample of size n is collected on a variable. If the population of


possible outcomes for this variable has a normal distribution with mean  and standard
deviation  , then the sample mean, X also has a normal distribution with mean  , but with
standard deviation  n.

This result is what allowed us to standardize X . From this result, we were able to state that if
the population of possible outcomes for a variable has a normal distribution, then

X 
Z
 n

has a standard normal distribution. Case 3A differs from Case 2A in that there are now two
populations and therefore, we have collected two samples, and have two sample means that we
are trying to incorporate into a test statistic. Further, as is the scenario in Case 3A, the two
samples that have been collected do not need to be of the same size. Indeed, we have 11
observations in the sleep deprived sample and 10 individuals in the rested sample. In general, let
n1 be the size of the sample collected from Population 1 and let n2 be the size of the sample
gathered from Population 2.

If we look closely at the result of the above theorem we can see how the concluding phrase “...
X also has a normal distribution with mean  , but with standard deviation  n ” leads to
construction of the Z statistic. In terms of words, we see that

X  Mean of X
Z .
Standard Deviation of X

If a similar theorem could be borrowed from mathematics regarding the distribution of X1  X 2 ,


then we could standardize X1  X 2 and create another Z-statistic. This new Z-statistic would then
be relevant to a two-sample investigation rather than our one sample inquiries. Such a theorem
from mathematical statistics does exist and it is similar in structure to the one sample theorem
above. While this theorem will provide us with the ability to standardize, it will not provide an
avenue to resolution of Case 3A. More development will be needed. For now, let’s look at the
theorem for two samples which parallels the theorem above for one sample.

71 | P a g e
Theorem: Suppose a random sample of size n1 is collected on a variable and a second random
sample of size n2 is collected on a second variable which is independent of the first variable.
Suppose the population of possible outcomes for the first variable has a normal distribution with
mean 1 and standard deviation  1 . Likewise, suppose the population of possible outcomes for
the second variable has a normal distribution with mean  2 and standard deviation  2 . Then,
the difference in the sample means, X1  X 2 also has a normal distribution with mean 1  2
and standard deviation 12 n1   22 n2 .

This theorem shares a fair amount of similarity with our previous one sample results. However,
the theorem is a bit longer since it has to account for all of the information on two variables and
two samples. There are several things to notice. First, the theorem requires that the populations
we are sampling have normal distributions. We have encountered this assumption before in Case
2A. Make sure you are aware that the results of the above theorem require normally distributed
populations. Second, notice that the standard deviation term in the result looks a little different
than what we saw when dealing with one sample. The values  12 and  22 under the radical are
known as the population variances. Lastly, do NOT be tempted to “distribute” the square root to
each term. From algebra, recall that in general, a  b  a  b . Thus, we should NOT write
the standard deviation of X1  X 2 as 12 n1   22 n2 .

Now, similar to before we can concentrate on the concluding phrase of the theorem in order to
construct a standardization formula. The final phrase “… X1  X 2 also has a normal distribution
with mean 1  2 and standard deviation  12 n1   22 n2 ” leads to

Z
X 1  X 2    1  2 
.
 12 n1   22 n2

However, as with Case 2A, the standard deviations of the variables in Case 3A are not known.
We do not know the standard deviation of the reaction time gains among all 18 to 25 year old
people who are sleep deprived. Likewise, we do not know the standard deviation of the gains
made by the population of rested individuals. We have collected samples of size 11 and 10,
respectively. From these samples, we can calculate the sample standard deviations. But the
population standard deviations,  1 and  2 , are unknown. We can calculate X1  X 2 from the
information in our samples and the null hypothesis claims that 1  2  0 . Thus, these two
quantities can be used to form the numerator of our test statistic. However, since the population
standard deviations are unknown, we cannot calculate the denominator of the above Z-statistic.

We have seen many times that it is typical to estimate parameters with statistics. This can surely
be done now. Despite not knowing  1 and  2 exactly, we can estimate them with s1 and s2 ,
respectively. We have sample data and thus the sample standard deviations can be obtained.

72 | P a g e
Therefore, it seems logical to replace  1 and  2 with s1 and s2 and attempt to carry on with a
resolution of Case 3A. A statistic that we can calculate would be

X 1  X 2    1  2 
.
s12 n1  s22 n2

Later, we will see that for large sample sizes, the above statistic is exactly what proves useful.
When dealing with a single population and sample, we replaced  with s , required the

population to be normally distributed, and then created the t-statistic, t   X    s n . 
Given we have again replaced the unknown population standard deviations by the sample
standard deviations, one would hope that the statistic

X 1  X 2    1  2 
s12 n1  s22 n2

also had a t distribution. Unfortunately, this is not the case. Obtaining the exact distribution of
the statistic above is a particularly difficult mathematical problem. This is especially true when
the sample sizes are small. Researchers have shown that this statistic can be taken to be
approximately t distributed in some circumstances. But, the accuracy of this approximation is
sometimes good and sometimes poor.

There is one situation in which we can arrive at a t-statistic when analyzing two population
means. If the two population standard deviations,  1 and  2 , can be taken to be equal then a t-
statistic will emerge. Of course, since we don’t know  1 and  2 , assuming that they are equal
could be problematic. Then again, at other times, this assumption can be justified. For instance,
consider a group of runners competing in a race that is one mile long. Among these runners
there will be some spread, or standard deviation, in the time it takes to complete the mile. If the
race were run into a headwind, we would expect the runner’s times to be slower, but we might
not necessarily see much of a difference in the spread between the finishing times. In this case, it
is as if the wind “pushed” the mean of a variable down without affecting the spread. These cases
where there is a “shift” or “shock” to the variable are reasonable scenarios in which standard
deviations could be assumed equal.

The scenario in Case 3A might very well fit the “shift” or “shock” motif. The act of being sleep
deprived may essentially “shift” the reaction times in a way that changes their collective average,
but otherwise, the spread and shape of the resulting data could be assumed to be the same. Many
experiments that fit this treatment vs. control structure adhere to the shift model. With this is
mind, return to the standardized form of X1  X 2 that generated the Z-statistic. If now we
assume that the two population standard deviations are equal, 1   2   say,   , then our Z-
statistic becomes

73 | P a g e
Z
X 1  X 2    1  2 

X 1  X 2    1  2 

X 1  X 2    1  2 
.
 12 n1   22 n2  2 n1   2 n2  1 n1  1 n2

The algebra under the radical begins by factoring a value of  2 from each term. Then, since
ab  a b , we can take the square root of  2 , which is just  since this value is positive.
The development is shown below:

2 2 1 1 1 1
  2      .
n1 n2  n1 n2  n1 n2

So, at this point we have a Z-statistic associated with the hypothesis test from Case 3A given by

Z
X 1  X 2    1  2 
.
 1 n1  1 n2

This test statistic still cannot be used to complete Case 3A. Despite having assumed that
1   2   say,   , we don’t know what this common value of  is. So, again, as we have
done many times before with parameters that cause us this nuisance, we seek to “replace”  .
But, with what? One advantage of having  1 and  2 in the problem was that they were
naturally estimated by s1 and s2 , respectively. Having assumed 1   2     , do we replace
 with s1 or s2 ?

The Need for Pooling

The sample from Population 1 generates s1 and it is an estimate of  . Likewise, the sample
from Population 2 generates s2 and it is also an estimate of  . Since we have assumed that
both samples are random samples from their respective populations there is no reason use one of
these statistics and completely ignore the other. However, it would make sense to pool the two
estimates together in order to make an overall more accurate estimate of  . That is, it might be
reasonable to combine s1 and s2 together in some way to generate an estimate of  . How
should this combination be constructed? It makes sense to weigh the sample standard deviation
based on the larger sample size more than the sample standard deviation from the smaller sample
size. If the two sample sizes were exactly the same then we could just calculate the regular
average of s1 and s2 in order to get our estimate of  . But, if the sample sizes are different, we
need to create a weighted average. This weighted averages is called the pooled standard
deviation and it is denoted by s p . The formula for s p is

sp 
 n1  1 s12   n2  1 s22 .
n1  n2  2

74 | P a g e
Notice that the value underneath the radical is a linear combination of s1 and s2 . The weights
on the values of s1 and s2 are entirely in terms of the sample size. Thus, the sample comprised
of more data will receive the greater weight. To see this, consider rewriting the linear
combination under the radical. The value under the radical is known as the pooled variance and
is denoted s 2p . It can be written as

s 2

 n1  1 s12   n2  1 s22

n1  1
s12 
n2  1
s22 .
n1  n2  2 n1  n2  2 n1  n2  2
p

Specifically, in Case 3A, we have a sample of size 11 from Population 1 and a sample of size 10
from Population 2. So, the first sample is slightly larger and thus it makes sense to weigh s1 a
little bit heavier than s 2 in the calculation of the pooled standard deviation. Plugging in n1  11
and n2  10 gives us
10 2 9 2
sp  s1  s2 .
19 19

Sure enough, the value of s1 is weighted slightly heavier that the value of s2  1019  199  . Now
that we have created the pooled standard deviation, s p , we can use it in place of  and the result
is a test statistic that we can actually numerically calculate:

X 1  X 2    1  2 
.
s p 1 n1  1 n2

The distribution of this statistic under the assumption of sampling from normally distributed
populations is given in the following theorem.

Theorem: Suppose a random sample of size n1 is collected on a variable and a second random
sample of size n2 is collected on a second variable which is independent of the first variable.
Suppose the population of possible outcomes for the first variable has a normal distribution with
mean 1 and standard deviation  . Likewise, suppose the population of possible outcomes for
the second variable has a normal distribution with mean  2 and standard deviation  . If the
unknown common standard deviation  is replaced by s p , then the quantity

t=
X 1  X 2    1  2 
s p 1 n1  1 n2

has a t distribution with n1  n2  2 degrees of freedom.

75 | P a g e
Null Distribution for the 2- Sample Test on Means Assuming Common Standard Deviation

The theorem above provides us with a test statistic and null distribution for the resolution of the
hypothesis test in Case Question 3A. When dividing by a quantity that involves a standard
deviation based on samples  s p  rather than the population standard deviation   we use the
phrase studentizing, rather than standardizing. Thus, the quantity in the theorem is the
studentization formula for the difference in two sample means.

The theorem above is a major step towards calculating a p-value in order to resolve Case 3A.
However, realize that in order to use the theorem we must assume that the data in our samples
came from normally distributed populations. Our theorems only address the situation where the
distribution of the populations are normal. While investigating Case 2A, we constructed
histograms for a sample comprised of 18 data points. Here, our sample sizes are even smaller. It
is very difficult to infer a reasonable shape for the distribution of a population from samples as
small as 10 or 11. Creating a histogram for sample this small will not be overly informative.
However, as with Case 2A, we can use descriptive statistics mainly to check for extreme skew or
outliers. With this in mind, consider the following two histograms produced by JMP for the
sleep deprived and rested reaction time samples.

Sleep Deprived Rested

On the basis of the above histograms, there is no strong evidence to contradict the reasonableness
of making an assumption of normality. With the size of the samples in mind, it is entirely
reasonable to believe that these samples came from larger populations that are normally
distributed. Since there is no obvious contradiction to normality, the studentized form of the
difference in the two sample means is a good choice for a test statistic in Case 3A.

Recall our null and alternative hypotheses in Case 3A:

76 | P a g e
H 0 : 1  2 (the average gain made by the sleep deprived group is the same as the average gain
made by the rested group)

H A : 1  2 (the gains made by the rested group are larger than the gains made by the sleep
deprived volunteers on average)

To facilitate analysis, we will assume that the sample of size n1  11 reaction time values for the
sleep deprived group represents a random sample from a normally distributed population (at least
approximately symmetric without evidence for skew and multiple modes). Likewise, we will
assume that the sample of size n2  10 from the rested population of people is a random sample
from another normally distributed population. Since our parameter of interest is the difference in
population means 1  2 , we will estimate it with the statistic X1  X 2 . Once we studentize
X1  X 2 , we have a test statistic which is

t
X 1  X 2    1  2 
s p 1 n1  1 n2

and the distribution of this statistic when the null hypothesis is true is t with n1  n2  2 degrees
of freedom.

Case Question 3A Concluding Decision

We are now in a position to calculate the value of the t-test statistic and use tables or Excel to
obtain the p-value of the test. From the p-value we will be able to make a determination as to
whether we have sufficient statistical evidence to reject H 0 : 1  2 . Before we do this, recall
the following terms related to hypothesis tests:

Type I error: Making the decision to reject the null hypothesis when, in fact, the null
hypothesis is true.

Type II error: Making the decision to not reject the null hypothesis when, in fact, the
alternative hypothesis is true.

Significance Level   - the probability of committing a Type I error.

In Case 3A, making a Type I error is claiming that the rested group made larger gains than the
sleep deprived group when in reality the gains among the two groups were equivalent on
average. Making a Type II error is claiming there is no difference among the two groups, when
in fact, the rested group did make larger gains.

When the experiment was being planned and the hypotheses being formulated, these two errors
should have been thought about by the experimental team. It would be proper for the researchers

77 | P a g e
investigating sleep deprivation and its effects on reaction times to discuss the implications of
these errors prior to gathering data. The consequences (financial, environmental, physical,
scientific, etc.) of making a Type I error should be weighed against the repercussion of making a
Type II error. After these two errors have been discussed, the experimenters could then (in
advance of experimentation and data collection) choose an   level. The more the researchers
fear making a Type I error, the lower the significance level should be set. Once the   level is
set, it can be compared against the p-value. If the p-value is lower than  , then H 0 is rejected
and we will conclude that the gains made by the rested group are larger than the gains made by
the sleep deprived volunteers. However, if the p-value is greater than or equal to  , H 0 will be
retained.

Remember, as a rough guide, investigators that feel the consequences of making a Type I error
are particularly severe typically choose   1% (   .01 ). On the other hand, when Type II
errors are considered most dire, experimenters often choose   10% (   .10 ). In the cases
that both errors are seen as equally serious, or there is little way to know the effect of the errors,
then a general compromise is struck and the significance level chosen to be near 5%.

We have several summary statistics to calculate. The sample means, X 1 and X 2 are simply the
averages of the two samples. The total of the data in the sleep deprived sample is 42.9 and the
total of the data from the rested group is 198.2. Thus, we have x1  3.9 and x2  19.82 . Next,
the sample standard deviation of the data x1 , , xn is calculated using the formula

 x  x
n 2
i 1 i
.
n 1
The computational short-cut formula for the sample standard deviation is

 x
 xi2  n i
2

.
n 1
The standard deviation of each sample must be calculated and then these two statistics will be
pooled. We need to apply the sample standard deviation formula to x1i for i  1 11 in order to
obtain s1 and then apply the formula to x2 j for j  1 10 in order to obtain s2 . These
calculations can be made by hand, in Excel or with JMP. For the sleep deprived sample ( x1i
i  1 11 ) we obtain (you check)

1648.85  
42.9 
2

s1  11
 148.154  12.17 .
10

A similar calculation shows that s2  14.73 . Next, we can calculate the pooled estimate of the
population standard deviation. Recall that we have assumed the two populations being sampled

78 | P a g e
have the same standard deviation. This assumption appears confirmed in that the sample standard
deviations are reasonably close to each other. The pooled standard deviation is

 n1  1 s12   n2  1 s22 10 12.17    9 14.73


2 2

sp    180.7287  13.44 .
n1  n2  2 19

Putting this all together, the value of the test statistic is

t
X 1  X 2    1  2 

 3.9  19.82   0  2.711 .
s p 1 n1  1 n2 13.44  1 11  1 10
Realize that when calculating the value of the test statistic, we insert 0 in for 1  2 . This is due
to the null hypothesis being relevant when we are calculating a p-value from the null distribution.
The null hypothesis is that 1  2  0 To determine if t  2.711 is a statistically significant
result which indicates the null hypothesis should be rejected, we will need to calculate a p-value
for the test.

p-value: The chance of observing the value of the statistic from your sample (or one more
extreme) if, in fact, the null hypothesis is true.

Recall, the meaning of “more extreme” is governed by the direction of the alternative hypothesis.
Since the alternative hypothesis is H A : 1  2 , which is to say 1  2  0 , extreme evidence
for the alternative consists of very low values of the test statistic (notice the “<” in H A ). The
distribution of the test statistic when H 0 is true is t with 19 degrees of freedom (denoted t19 ).
Thus, the p-value used in the resolution of Case 3A is

p-value = P  t19  2.711 .

If using a t-table, we would need to remember that all t distributions are symmetric. Thus,
P  t19  2.711  P  t19  2.711 . A t-table tells us that this p-value is between .005 and .01.
Excel or statistical software can obtain a more accurate answer. We can use the Excel command
=T.DIST(-2.711, 19, TRUE) in any spreadsheet cell to conclude that the p-value is .0069. This
p-value is lower than all typical   levels, so even if the experimenters had chosen   .01 , we
have sufficient evidence to reject the null hypothesis. Therefore, on the basis of the assumptions
made in Case 3A and the data collected by the researchers, we claim that the gains made by the
rested group are larger than the gains made by the sleep deprived volunteers on average. The
rest really did make a difference in terms of visual learning. The test that has been developed in
conjunction with Case Question 3A is called the two-sample t-test (assuming equal population
standard deviations). The two-sample t-test is an inferential procedure for comparing the
means of two populations.

79 | P a g e
Confidence Interval For the Difference in Two Population Means Based on the Two-
Sample t-test

As we have done in previous case studies, we can construct a confidence interval in order to
estimate the parameter 1  2 . Some experimental situations involving data from two samples
may not lead to consideration of a hypothesis test. Often, the objective is simply to estimate the
difference between the two means. If we need a single estimate of 1  2 , then we know that
this can be accomplished with the difference in the sample means, X1  X 2 . In order to
construct a range of estimates – that is, formulate a confidence interval, we will need to know the
distribution of X1  X 2 . Provided we are sampling from two independent normally distributed
populations, we know that the statistic

X 1  X 2    1  2 
s p 1 n1  1 n2

has a t distribution with n1  n2  2 degrees of freedom. Previously, when concluding Case 2A,
we used the notation t 2,df to denote the value creating the boundary for the uppermost  2 area
under the t density curve with df degrees of freedom. See the graph below where the area in the
shaded tail is  2 .

t 2,df
If we let T denote a t random variable, then this picture can also facilitate the probability
statement:
P  t 2,df  T  t 2,df   1   .

As an example, in Case 3A we have that df  19 and so if   .05 we can use t-tables or


software to conclude
P  2.093  t19  2.093  .95 .

Next, provided we are sampling from two independent normally distributed populations, the
statistic

80 | P a g e
X 1  X 2    1  2 
s p 1 n1  1 n2

has a t distribution and therefore we can substitute this equation in for T. So,


P  t 2,df 
 X1  X 2    1  2   t 
  1  .
 s p 1 n1  1 n2
 2, df

 
As we have done in previous confidence interval development, we can now just algebraically
rearrange the inequality until the parameter 1  2 has been isolated in the center. When this is
completed, we have the 95% confidence interval for the difference in two population means.
This interval is appropriate to use as an estimation tool when the populations being sampled can
reasonably be assumed to be normal (or at least approximately unimodal and symmetric, devoid
of extreme skew and/or outliers). Our formula for the 95% confidence interval for 1  2 under
these assumptions is
 X1  X 2   t 2, n1 n2 2  s p 1 n1  1 n2 .
Suppose a random sample of size n1 is collected on a variable and a second random sample of
size n2 is collected on a second variable which is independent of the first variable. Suppose the
population of possible outcomes for the first variable has a normal distribution with mean 1 and
standard deviation  . Likewise, suppose the population of possible outcomes for the second
variable has a normal distribution with mean  2 and standard deviation  . If the unknown
common standard deviation  is replaced by s p , then the quantity

X 1  
 X 2   t 2, n1 n2 2 s p 1 n1  1 n2

is a 100 1    % confidence interval for the difference in the population means, 1  2 .

For the sleep deprivation data of Case 3A, x1  3.9 and x2  19.82 . Also, the pooled standard
deviation is s p  13.44 . The sample sizes are n1  11 and n2  10 . We know for 19 degrees of
freedom, the correct t-value to use in a 95% confidence interval is 2.093. Thus, the 95%
confidence interval for the difference in reaction time mean gains is

15.92   2.09313.44  1 11  1 10

which gives  28.21, 3.63 . So, a reasonable range of guesses for 1  2 is between -28.21
milliseconds and -3.63 milliseconds. Notice that the confidence interval lies entirely below zero,
which seems reasonable given that we rejected H 0 : 1  2 in favor of H A : 1  2 . The
confidence interval has provided us with an estimate of just how much faster the reaction times
are in the rested group. It appears that gains made by the rested group can reasonably be
assumed to be anywhere from 3.63 milliseconds to as much as 28.21 milliseconds.

81 | P a g e
We say that we are “95% confident” that the value of 1  2 is between -28.21 and -3.63. The
reader should review the proper interpretation of a 95% confidence interval which was discussed
in previous cases. The “confidence” that we have actually isn’t in the two numbers that
comprise our confidence interval. Interpreting a confidence interval is never about you or about
the results that you personally obtained from your sample. Instead, the confidence is in the
mathematical procedure that generated the expression for the interval.

Statistical Follow Up to Case 3A: Large Sample Test For The Difference in Population
Means

In Case 2B, we saw that the Central Limit Theorem guarantees us that the sample mean is
approximately normally distributed as the sample size gets large. This statement is true for all
populations we might sample in nature. The use of a t-statistic theoretically presumes that the
populations being sampled are normal. This is not a requirement for use of the Central Limit
Theorem. But, making use of a large sample is a requirement for use of the CLT.

Therefore, if we have a large sample from each of two populations and we desire a hypothesis
test for H 0 : 1  2 , then we can start with the following realizations:

 There is no need to assume the two samples come from normal populations.
Independence will suffice without normality. The Central Limit Theorem assures us that
X 1 is approximately normal and that X 2 is approximately normal. Mathematical
statistics can show that these two facts can be combined to conclude that X1  X 2 is
approximately normal.
 There is no need to use the pooled standard deviation. Each sample is large enough to
provide a reliable measure of the common value of  . Thus, s1 and s2 can be used in
test statistic and confidence interval formulas without the need for s p .

The two bullets above allow for the use of the statistic

Z
X 1  X 2    1  2 
s12 n1  s22 n2

when conducting large sample hypothesis tests for the difference in two population means. This
test statistic is approximately normal and the p-values resulting from its use are also approximate
(but generally considered very accurate). In similar fashion, if an experimenter would like to
estimate 1  2 on the basis of data collected from large independent samples, the confidence
interval formula
X 1  X 2   z 2 s12 n1  s22 n2

82 | P a g e
can be used. The result is an approximate 100 1    % confidence interval for the difference in
the population means, 1  2 . The value z 2 is the value that places  2 area in the right hand
tail of the standard normal distribution. For 95% confidence intervals, z 2  1.96 . For 99%
confidence intervals use z 2  2.575 and for 90% confidence intervals use z 2  1.645 .

Statistical Follow Up to Case 3A: The Problem of Unequal Population Standard Deviations

Statisticians have attempted to come up with procedures for testing H 0 : 1  2 when the data
come from normal populations with different standard deviations. This is called the Behrens-
Fisher problem and unfortunately, an exact distribution of an appropriate test statistic has not
been discovered. In the case that the population standard deviations  1 and  2 are unknown and
unequal, an approximate procedure makes use of our statistic

X 1  X 2    1  2 
.
s12 n1  s22 n2

This statistic is neither t distributed nor Z distributed when  1   2 . Again, no method to


generate exact p-values is known. This is particularly problematic when the sample sizes are
small since in this case the Central Limit Theorem does not apply. The Welch approximate t-test
is a procedure developed in 1938 by its namesake which treats the above test statistic as
approximately t-distributed. However, a rather gross modification to the degrees of freedom
must be made. The Welch procedure calculates rs  s12 s22 and then advises using degrees of
freedom equal to
r  
s
n1 2
n2
.
 
2
1
n1 1 s
2
r  1
n2 1
n1
n2

Generally, use of this modified degree of freedom formula does not produce an integer and so
the conservative approach is to always round down (that is, ignore the decimals so that 13.843
becomes 13, etc.). Approximate p-values can then be found using t-tables or the Excel t-
distribution commands.

Statisticians will differ on the issue of when it is proper to use the approach assuming equal
standard deviations versus using the Welch approximate approach. The former is far more
popular a technique while the latter does not require an assumption of equal standard deviations.
The comparison between these two procedures could be explored in more advanced statistics
courses. Loosely, for moderate sample sizes, if one of the sample standard deviations is not
twice as large as the other, then the assumption of equal population standard deviations tends to
pose few problems. If the sample sizes are small, relatively different and one sample standard
deviation is many times larger than the other, then assuming equal population standard
deviations may be unwise. There are formal tests available in statistics to test the equality of
population standard deviations. These tests are a normal part of a applied statisticians tool kit

83 | P a g e
and they certainly have their place in problem solving. However, using these tests as a precursor
to choosing which test for H 0 : 1  2 will be implemented is not without controversy. For now,
without further study into various theoretical considerations, this practice is not recommended.
Instead, an informal consideration of the sample sizes, relative sizes of the sample standard
deviations, and examination of histograms (if possible) should collectively suffice for choosing
between the two sample t-test assuming equal standard deviations and the Welch approximate
procedure.

Concepts and Terms to Review from Case Question 3A

population Type I error


sample Type II error
variable of interest significance level  
parameter short-cut formula for sample standard deviation
statistic null distribution
null hypothesis two sample t-test
alternative hypothesis 95% confidence interval for the difference in two
population means
test statistic 95% confident
sampling distribution Central Limit Theorem
p-value Large Sample Test for the Difference in Population Means
standardizing Large Sample Confidence Interval for the Difference in
Population Means
independent random variables Welch approximate t-test
population variance
shift model
pooled standard deviation
weighted average
studentization formula for the difference
in two sample means

84 | P a g e
Case Study #4A: Revenue From Sugar Cane
Introduction

Variability is a key concept in statistical science. Statisticians are often interested in problems
that examine how one variable is affected in response to changes in another variable. That is, the
focus is on how two (or more) variables change “together” – how they co-vary. Measuring this
covariance might allow for one variable to be predicted by knowing the value of another.

In business, the bottom line is always profit or loss. Additionally, it is informative to address the
amount of profit or loss that could be expected in particular situations. In the business of sugar
cane farming, a chain of events seems clear: plant more sugar cane, harvest more sugar cane,
produce more sugar, sell more sugar … all to make more profit.

It makes sense that if a farmer can harvest more acreage of sugar cane, he stands to be able to
produce more sugar and increase his revenue. But statistically, if the farmer increases or
decreases the amount of acreage harvested, how much revenue would be expected to be gained
or lost? Can data on sugar cane harvested and the subsequent amount of sugar production be
used to create a statistical model? Can this model impact a farmer’s decision on how much cane
to harvest? Can such a model be useful in predicting revenue if one were to buy acreage for the
purpose of farming?

Case Question 4A

A farmer is considering devoting an extra 100 acres of his land to sugar cane and would like to
predict the additional yearly revenue from doing so. The data available are from 14 Louisiana
parishes. Records have been kept for each of these parishes regarding the amount of acreage
harvested and the amount of sugar produced during the last year. Economic indicators are such
that the prospective buyer can generate a revenue of 3 cents per pound of sugar produced. What
increase in revenue can be expected if the farmer devotes the 100 acres of land to sugar cane?

The data from the 14 Louisiana parishes is given below:


Parish 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Harvested 337 152 144 23 302 131 296 202 338 205 331 80 411 179
(100 acres)
Production 940 460 440 65 830 380 860 590 1020 585 1020 200 1130 570
(1000 tons)

Population and Sample

The physical location where this case takes place is Louisiana. The focus is just on land in
Louisiana that is used for growing sugar cane. Therefore, a reasonable target population is all
land used for sugar cane farming in the state of Louisiana. The only data available to us in our
investigation regarding potential revenue are the data on harvested acreage and sugar production

85 | P a g e
in 14 parishes (a parish is similar to the notion of a county). Thus, the sample in Case 4A is the
14 parishes for which we have sugar cane farming information. We will assume that the 14
parishes constitute a random sample of sugar cane farm land in Louisiana. There is no direct
evidence in the problem to the contrary. On the basis of the information given, it is reasonable to
believe that the parishes sampled are representative of all locations in Louisiana used for sugar
cane production.

Random Variables of Interest

For each of the 14 parishes sampled, we have measurements on two variables. The fact that
there are two variables is one similarity with the data structure from Case 3A. However, in Case
4A, the data are naturally paired. We have observed the harvested acreage in a parish and for
that same parish we also have a measurement on the amount of sugar production. In addition,
answering Case Question 4A will require us to assess how much sugar should be expected from
an additional 100 acres of land. If we can obtain an estimate of how much sugar we should
expect 100 acres of land to produce, then we can obtain an estimate of the revenue.

So, the two sample data in Case 4A differs from the two sample data in Case 3A in several
critical ways:
 The data in Case 4A are paired. Each parish sampled provides us with harvested acreage
and sugar production data.
 We want to use the paired data in order to predict sugar cane production from the amount
of land that was harvested. We desire a statistical model that can used to explain the
relationship between the two variables.

Often, data is collected on multiple variables for each subject or unit under study (say, parishes).
When statisticians use data such as this to a) predict one variable from the outcomes of the other
variables or b) create a model that mathematically explains the relationships between the
variables, they are using correlation and regression methods. For us, we have two variables and
we would like to know how these two variables are related – how they correlate. Also, we
would like to predict sugar production from harvested acreage.

Typically, in structures such as this the observed data are ordered pairs:  xi , yi  for i  1 n .
Thus, the Louisiana sugar cane data could be represented as the 14 ordered pairs (337,940), (152,
460), (144, 440), (23, 65), (302, 830), (131, 380), (296, 860), (202, 590), (338, 1020), (205, 585),
(331, 1020), (80, 200), (411, 1130) and (179, 570).

In a regression problem, the variable that is being predicted is called the dependent variable and
is typically denoted Yi . The variable that is being used in order to predict the dependent variable
is denoted X i and is often referred to as the independent variable. Sometimes more than one
independent variable is used to predict a single dependent variable. This is called a multiple
regression problem since multiple independent variables are used. Case 4A is an example of
what is called a simple regression problem where the word “simple” is used to indicate use of a
single independent variable. Therefore, the variables of interest in Case 4A are:

86 | P a g e
 X  the amount of harvested acreage for the purpose of sugar cane farming in a
Louisiana parish
 Y  the amount of sugar production on the sugar cane farms in a Louisiana parish.

These variables are observed together in pairs  xi , yi  for i  1 14 . That is, our data collected
on these variables is comprised of 14 ordered pairs.

Scatterplots: A Graphical Descriptive Statistic Tool Used In Correlation & Regression


Problems

Since the data are ordered pairs, it makes sense to plot the 14 ordered pairs as coordinates on an
 x, y  grid. Doing this creates what is called a scatterplot. The goal of the scatterplot is similar
to the goal when drawing a histogram for a single variable: look for trends, shapes and critical
features of the data. We would like to “see” any relationship that might exist between X and Y
before we create a statistical model. Be careful not to contract or expand the x-axis and y-axis in
an artificial way when drawing scatterplots. Later we will see that the statistical model that is
created to describe the relationship between X and Y is only valid over the range of collected
data. So, it is not necessary to include values on the axes that are far from the observed x and y-
coordinates.

When the 14 ordered pairs in Case 4A are entered into the statistical software JMP, a scatterplot
can be produced. To do this open a new Data Table (spreadsheet), type the values of the
harvested acreage in one column and the sugar produced in second column. Each row in the
Data Table represents an ordered pair. Next, click on Analyze>Fit Y by X. Then click and drag
the variable for acreage into the “X, Factor” box. Likewise, click and drag the production
variable into the “Y, Response” box. Once this is done, under the Action Box, just click “OK”.
The following scatterplot is produced.

87 | P a g e
The data can also be entered into two columns in Excel and then highlighted with the cursor.
Once the data (both columns) are selected, the user can click on the “Insert” tab and then choose
“Scatter” in the list of plots. The first option under this palette is “Scatter with Only Markers”.
Clicking on this icon produces a scatterplot. Then, from the Excel menus, choosing Chart Tools
and going to the “Layout” tab will allow the user to label the axes and insert other options to
enhance the plot.

The scatterplot shows a definitive trend. As the amount of harvested acreage increases, so does
the sugar cane production. This was to be expected. However, what is interesting is that the
relationship between acreage and production appears to be linear. Statisticians use scatterplots
to investigate the type of relationship observed – if any – between the independent and dependent
variables. For Case 4A, it appears as though production increases linearly in acreage. Thus, we
will attempt to build a simple linear regression model to aid in the solution of Case Question
4A. Had the scatterplot shown a polynomial trend, we might have investigated a polynomial
regression model. Likewise, a scatterplot may show a logarithmic or exponential trend. The
observed trend then drives the type of regression model used. Then again, a scatterplot could
show a complete lack of trend. In these cases it could be that the X variable has little predictive
information in it concerning Y. If no trend is seen in the scatterplot, then X may not be able to
predict Y in a satisfactory way.

Even though the trend seen in our scatterplot is linear, notice that all the ordered pairs do not fall
uniquely on one line. This should not concern us in the least. Even if there is an underlying
theoretical linear relationship between acreage and production, we know that real data contains
natural variation. That is, we must envision a true theoretical trend line governing the
acreage/production relationship and then our observed data clustering around this line. Natural
variation produces scatter around this theoretical trend line. The lower this variation is, the more
tight our data will cluster around this theoretical line. If there is a large amount of sampling error
and/or natural variation, then the data may only cluster loosely around the theoretical trend line.

In our case, the data from the 14 Louisiana parishes clusters tightly. This tight clustering gives
the impression that sugar production and acreage are highly correlated. Since the slope of the
trend seen in the scatterplot is positive, we say that the association between our two variables
exhibits a positive correlation. When two variables are positively correlated, then as one
variable increases, the other tends to also increase. Similarly, as one variable decreases, the
other tends to decrease. This is true for the data in Case 4A.
If outcomes of one variable increase and this tends to be associated with a decrease in another
variable, we say that the variables are negatively correlated. If the relationship between these
two variables is still linear as could be seen in a scatterplot, then the slope of the underlying trend
line would be negative.
At this point, our scatterplot motivates two questions:

 How can we measure the strength of the linear association we see between acreage and
production?
 How can we obtain the equation of a line that “best fits” the data?

88 | P a g e
The answer to the first question can be addressed by a correlation coefficient. The second
question is answered through obtaining the least squares regression line fit to the observed
data.

Parameters of Interest

The two questions above hint at a set parameters worth our investigation. Parameters are
numerical characteristics of a population. When considering all land used for sugar cane farming
in the state of Louisiana we can conceive of a theoretical correlation that exists between acreage
and production. We can also think of a theoretical linear relationship between the two variables.
This theoretical line pertains to all of the sugar cane farmland. Thinking in this abstract manner
allows us to state the following parameters of interest.

  (rho) = the correlation coefficient between acreage harvested and sugar production on
all land in Louisiana used for sugar cane farming. The parameter  describes the
strength of the linear association that exists between our two variables for all land in
Louisiana used for sugar cane farming.
  0 and 1 : The intercept and slope, respectively, of the theoretical trend line that exists
between harvested acreage (X) and sugar production (Y). This trend line has equation
Y  0  1 X .

The above three parameters are unknown and must be estimated with statistics that can be
calculated from our observed data.

Since we acknowledge that observed data seen in nature will tend to exhibit scatter around any
theoretical trend line, statisticians study linear models that incorporate what are called error
terms. These error terms represent the natural scatter one would expect to observe around a
theoretical trend line when collecting data. Including an error term as an addition to the equation
of a line creates a linear statistical model which can be expressed as

Yi  0  1 X i   i ,
signal noise

where i  1 n . This statistical model states that each observation of the dependent variable is
equal to a linear function of the dependent variable plus random scatter. The theoretical trend
line represents the true underlying signal or trend that exists in nature. The error terms represent
the noise present in data due to sampling errors, bias and natural variation. The scatter around
the trend line is captured by this term in the linear model.

This model states that even if we observed the same value of X multiple times, we would see
different values of Y due to the inclusion of the error term  . We would hope that among many
observations of the same X value, the average of all the errors would be 0. This is typically
assumed when working with linear statistical models. Although, it should be noted that such an
assumption – or any other type of assumption – is NOT required in order to answer the two
questions in the box above. All that is necessary to assume in order to answer the questions in the
box above is obtaining a random sample of ordered pairs. For purposes that we will see later, the

89 | P a g e
standard deviation in the random scatter of Y values that one would observe for any particular X
value is generally taken to the same for all X values and denoted  . If important to estimate,
this parameter could be added to the list above. The other three parameters can be estimated
without any knowledge or care about 

A Statistic Useful in the Study of Correlation

Recall,  is the correlation coefficient between acreage harvested and sugar production on all
land in Louisiana used for sugar cane farming. The value of  is unknown and must be
estimated using our observed ordered pairs. Recall that if we have one sample of data taken
from a quantitative variable that the sample standard deviation based on x1 , , xn is the square
root of
1
s2    xi  x  .
2

n 1

The value of s 2 is the sample variance. The statistics s and s 2 are used to measure the
variability present in the data. For emphasis, we could write
1
s2 
n 1
  xi  x  xi  x  .
At present, we wish to describe the way in which the independent (X) and dependent (Y)
variables vary together. That is, we need not measure just the variability inherent in one variable
but the sample covariance between outcomes of two variables. To do this, we could alter the
expression above to incorporate deviations from the mean for the x-values and also the y-values:
1
n 1
  xi  x  yi  y 
Now, it is really important to notice that this quantity will be positive when above average x-
values tend to be paired with above average y-values (a positive times a positive equals a
positive). This quantity will also be positive when below average x-values tend to be paired with
below average y-values (a negative times a negative equals a positive). So, if our data behaves in
this way, the covariance is positive. The data in Case 4A is an example of ordered pairs having a
positive covariance.

If above average x-values tend to be paired with below average y-values and vice versa, then the
covariance will be negative (one positive value times one other negative value is a negative).
Thus, the covariance can use information in the data to estimate the direction of the linear
relationship between the two variables of interest. The most important thing about the sample
covariance is its sign – indicating the direction of the linear relationship.

A statistic which can estimate  is obtained by dividing the sample covariance by s X sY , the
product of the sample standard deviations. The resulting value is called the sample correlation

90 | P a g e
coefficient and is denoted r . The statistic r is a point estimate for the parameter  (notice that
“rho” starts with the letter “r”).

Sample Correlation Coefficient: The strength of the linear association between two quantitative
variables (X) and (Y) can be estimated using the observed ordered pairs  xi , yi  for i  1 n .
The statistic that is most commonly used to do this is the sample correlation coefficient (r) and is
calculated by
1
r  n 1
  xi  x  yi  y  ,
sx s y

where s x is the standard deviation based on x1 , , xn and s y is the standard deviation based on
y1 , , yn .

Both the population (theoretical) correlation coefficient  and the sample correlation
coefficient  r  must fall between -1 and +1. Values close the boundaries of the interval  1, 1
indicate strong positive (near +1) or strong negative (near -1) association. This is reflected in a
tight clustering in the corresponding scatterplot of the data. For instance, a value of r  .94
would indicate a very strong linear relationship between two variables in the positive direction.
A value of r  .89 would indicate a strong negative relationship between the variables. As the
value of r moves farther away from the boundary points, the relationship would be characterized
as moderate and ultimately, weak. For instance, r  .58 may indicate a moderate positive
relationship between two variables and r  .27 may indicate a weak negative relationship.
Values near zero indicate little to no linear relationship between the two variables in the
investigation.

The use of the words “strong”, “moderate” and “weak” do not have any definitive cutoffs for
their use when speaking of correlations. In fact, adjectives such as these are very dependent on
the scenario in which the data appear. What a structural engineer may call a moderate linear
relationship, a psychologist may call strong. Researchers from different disciplines involved in
different types of correlation studies tend to dictate the appropriate use of these adjectives.

As with all statistical formulae involving standard deviations and/or variances, there exist “short-
cut” formulas. In additional to these “by-hand” shortcuts, we should certainly be willing to make
use of software. A computational shortcut formula for the sample correlation coefficient is

r
 x y    x   y  n
i i i i
.
 x   x  n  y   y  n 
     
2 2 2 2
i i i i

Once a scatterplot has been produced in JMP, it is simple to obtain the value of r from the
computer. We will see this later. Suppose the 14 x-values (harvested acres) are placed in an
Excel spreadsheet in cells A1 to A14 and the 14 corresponding y-values (sugar production) are
placed in cells B1 to B14. Then the Excel code

91 | P a g e
=CORREL(A1:A14,B1:B14)
gives the value of r. Doing this for the data of Case 4A produces r  .9939 . By any standard,
this is a very strong positive correlation. Our scatterplot indicated a very strong linear trend
between harvested acreage and sugar production. The sample correlation coefficient has
quantified this association and tells us that the linear relationship is indeed very strong between
our two variables of interest. As we did the first time we encountered the sample standard
deviation, we present a table from which a by-hand calculation of r can be accomplished.
Because this process is lengthy, use of software is encouraged when working with even moderate
sample sizes.

xi yi xi2 yi2 xi yi

337 940 113569 883600 316780


152 460 23104 211600 69920
144 440 20736 193600 63360
23 65 529 4225 1495
302 830 91204 688900 250660
131 380 17161 144400 49780
296 860 87616 739600 254560
202 590 40804 348100 119180
338 1020 114244 1040400 344760
205 585 42025 342225 119925
331 1020 109561 1040400 337620
80 200 6400 40000 16000
411 1130 168921 1276900 464430
179 570 32041 324900 102030

x i  3131 y i  9090 x
2
i  867915 y2
i  7278850 x y i i  2510500

From this chart, we can now insert the appropriate totals along with n  14 to obtain

2510500   3131 9090 14 477586.4286


r   .9939 .
867915  31312 14 7278850  90902 14 480501.5056

Recall, our scatterplot motivated two questions:

 How can we measure the strength of the linear association we see between acreage and
production?
 How can we obtain the equation of a line that “best fits” the data?

At this point, we have answered the first of these questions with the sample correlation
coefficient, r. We now proceed to address the second question regarding a “best” line of fit to
the data in scatterplot. Along the way, we will discover that the answers to these two questions
are related.

92 | P a g e
The Line of Best Fit: What Do We Mean By “Best”?

Our task is now to choose a line to represent the trend in the data from Case 4A. If such a line
can be constructed, then the formula for this line could be used to predict sugar production based
on knowledge of how many acres were harvested. Additionally, the question from Case 4A
could be resolved. Remember that the slope of a line can be described as the ratio of the “rise”
in the y-variable to the “run” in the x-variable. So, the slope of our fitted line would represent
the expected rise is sugar production for a unit increase in harvested acreage. Since the “unit”
for acreage is 100 acres, the slope will represent the number of 1000-tons of sugar that we would
be expected to produce for every 100 acres of harvested land. Once we know the number of tons
that we would expect to produce per every 100 acres of land, we can multiply this value by 3
cents, since economic indicators are such that the prospective buyer can obtain a revenue of 3
cents per pound of sugar produced.

The issue is “how” can we fit a line to the data? What do we mean by “best” fit? Since it only
takes two points to completely determine a line, there are all sorts of possible lines that could be
explored for the sugar cane data. One way to obtain a line is to just pick the two most extreme
points in the scatterplot and connect them. If we did this, then we could determine the equation
of the line that goes through the points (23,65) and (411, 1130). Doing this would be quick, but
is it “best”? For two reasons (at least), it appears this is not an acceptable method. First, such a
line utilizes only information from two of the 14 parishes. Second, the resulting line can be
shown to fall “under” virtually all the other data that was collected. That is, the remaining 12
data points tend to fall above this line and therefore our line connecting the extreme two data
points is misleading when considering the bulk of the data.

Another method could be to connect all possible pairs of points and find the slope of all the
resulting lines. We could connect the first ordered pair to the second ordered pair and find the
resulting slope. We could then move on to two other ordered pairs – fit the line that connects
them and obtain the slope. We could keep doing this until all possible pairs have been
connected and all possible pairwise slopes obtained. That would be a lot of work! But then,
after we had all possible pairs connected, we could obtain the average or median of all the slopes
we had created. This resulting average or median of all the pairwise slopes could then be called
the “best” slope. Actually, this is not that bad of an idea. Doing this will often create a
reasonable slope estimate for the trend of the data. But, this method is quite intense. Plus, we
have no formula of a line if we do this. We have only a slope. So, this method does not generate
the equation of a line. We would like to use the equation for prediction and strictly speaking this
method doesn’t provide such an equation. Returning to the intensity criticism: This method is
sometimes used, but realize that for n ordered pairs, the number of ways to pair up the points is

n n! n  n  1
   .
 2  2! n  2 ! 2

For n  14 , the number of ways to pair up the points is 91. That’s a lot of slopes to calculate.
For n  100 , the number of ways to pair up the points is 4950. This method appears to involve a
lot of tedious calculation effort.

93 | P a g e
No matter what line is used to represent the trend seen in the scatterplot, the line will not be able
to go through all of the ordered pairs. The 14 data points do not all lie on a line, so any trend line
used will “miss” most or all of the points. Statisticians call the vertical distance by which a line
misses an ordered pair a residual. Clearly small residuals indicate that the trend line is close to
the actual ordered pair from the data. So, it makes sense to fit a line to our data that has small
residuals. To find the “best” line we could try to make these residuals as small as possible.

There is another reasonable requirement for a line of best fit. It makes sense to predict the
average harvested acreage to have the average amount of sugar production. That is, it seems
logical to require our line of best fit to pass through the point of averages  x , y  . Thus, we
would prefer a line that passes through the point of averages while also minimizing how much
the line “misses” the data points – that is, minimizing residuals. These are not unreasonable
requirements, but there is one subtle additional fact to realize. Any non-vertical line that passes
through the point of averages will have the sum of its residuals be zero. This is because it can be
shown that the positive residuals will cancel out the negative residuals. Said another way,
sometimes our line that goes through  x , y  will overshoot the data and at other times it will
undershoot the data. The total amount our line overshoots the data will equal the total amount
that our line undershoots the data. So, the grand total of all the residuals will be zero provided
we pass the line through  x , y  .

Suppose a set of data show a linear trend on a scatterplot and we pass a line through the ordered
pairs and the resulting residual for the 4th data point is the value e4  3 . Then, scanning the
remainder of the data we notice that the residual for the 7th data point is the value e7  3 . In
terms of accuracy, our line “misses” these two points by the same amount. In one case we
overshoot and in the other we undershoot. The net effect is cancellation. But, if we looked at
squared residuals, then each data point still contributes to the solution of finding a “best” fit
equally since e42  e72  9 . Thus, the cancellation problem is solved by attempting to minimize
the total of the squared residuals instead of minimizing the total of the residuals (which must
always be zero if we pass through  x , y  ).

Dealing with squared residuals is not a completely foreign concept to us. After all, when
defining the standard deviation of a sample we used the term   xi  x  . We called this the
2

sum of the squared deviations back when looking at numerical descriptive statistics for a
quantitative variable. The concept of squared residuals is similar.

The Line of Best Fit: Method of Least Squares

Capitalizing on the argument presented in the previous section, we define the line of best fit as
the line which minimizes the grand total of the squared residuals. Mathematics can show that the
line which accomplishes this goal will pass through  x , y  . If the residual associated with the
i th data point is denoted ei , i  1 n , then the method of least squares finds the equation of the

94 | P a g e
line which makes  ei2 as small as it could possibly be for the given data set. Obtaining the
equation of the line which minimizes the squared residuals is a calculus problem. In truth, it is
not a particularly hard calculus problem and it is one for which mathematicians and statisticians
have developed tried and true methods. This calculus problem has a solution that can be
presented in simple form.

Method of Least Squares For Simple Linear Regression: Suppose an experimenter has collected
n ordered pairs represented by  xi , yi  , i  1 n . The line which minimizes the grand total of the
squared residuals is called the least squares regression line and this line has slope ˆ  rs s
1 y x

and intercept ˆ0  y  ˆ1 x . Here, r is the sample correlation coefficient, s x is the standard
deviation of the values x1 , , xn and s y is the standard deviation of the values y1 , , yn .
Therefore, the formula for the least squares regression line is y  ˆ  ˆ x .
0 1

Thinking of the slope as “rise over run”, we see that the least squares regression line predicts that
if we increase the x-variable by one standard deviation, the associated rise or increase in the y-
variable should NOT be a full standard deviation in the y-direction. Instead, the predicted rise in
the y-variable is only r times a standard deviation in the y-direction. Recall, 1  r  1 so that
r s y  s y . This fact that an increase by s x is NOT predicted to cause an increase in the y-
direction of s y , but instead only r s y is called the regression effect. Frequently, scientists use
the phrase “regression to the mean”. Investigators using this term are referring to the regression
effect.

The above formula for the slope of the least squares regression line makes it clear that the study
of correlation and regression are linked. The slope of the line of best fit is driven by the
correlation present in the data. Statistical software such as JMP and spreadsheet packages such
as Excel can easily compute the formula for the least squares regression line. In fact, if a
scatterplot has already been produced in Excel, the least squares regression line can be added to
the plot by right-clicking on any data point and then selecting “Add Trendline”. Next, select the
“Linear” button and also place a check in the box marked “Display Equation on Chart”. When
this is done for the data in Case 4A, Excel produces the graph at the top of the next page. The
reader should notice that axis labels and a title were included by using the Chart Tools tab.

Looking at the Excel output, we see that the line of best fit has intercept ˆo  12.341 and slope
ˆ  2.848 . To check that this slope is indeed correct, recall that we have already calculated the
1

sample correlation coefficient for the sugar cane data to be r  .9939 . The reader should return
to table facilitating this calculation and use the information presented there to confirm
sx  113.57 and s y  325.44 . Now, we can utilize the formula for the slope of the line of best fit
to conclude ˆ  rs s  .9939  325.44  113.57   2.848 .
1 y x

95 | P a g e
Case 4A: Sugar Cane Production
1400

1200
y = 2.848x + 12.341
Sugar Produced (1000 Tons)

1000

800

600

400

200

0
0 100 200 300 400
Harvested Acreage (100 Acres)

The formula for the least squares regression line is y  12.341  2.848x . This equation can be
used to predict the sugar production  y  for any value of harvested acreage  x  that might be of
interest. When doing this, we generally denote the prediction by the notation ŷ . For instance,
suppose a parish contains 28,000 acres which can be harvested. Then the sugar production in
this parish could be predicted by inserting x  280 (notice 28,000 = 280 “100 acres”) into the
formula for the least squares regression line. Doing this produces

yˆ  12.341  2.848  280   809.781 .

Since the unit for sugar produces is “1000 tons”, we conclude that our prediction for this parish
is 809,781 tons of sugar (809,781 = 809.781 “1000 tons”). One should be careful not to use the
line of best fit for x-values that are beyond the range of the collected data. This is called
extrapolation and it can be potentially quite misleading. We have evidence that there is a linear
trend when considering “100 acreage” values between 23 and 411. To use the least squares
regression line to predict an x-value outside the range of 23 to 411 is assuming that the trend we
see extends to other acreage values. This is an assumption which is not justified based solely on
our data. After all, we have no data outside the range in our sample. Our inferences are
restricted to be over the range of data we observed.

The least squares regression line can be obtained in JMP by following the previous instructions
for creating a scatterplot. Next, click the red triangle and select “Fit Line”. The line of best fit is
automatically added to the scatterplot and the formula for the line is printed below the label
“Linear Fit” on the output. Right below this on the JMP output is a section titled “Summary of
Fit”. The first line in this summary states that r 2  .987903 . This statistic is called the

96 | P a g e
coefficient of determination and it is the square of the sample correlation coefficient. Taking
the square root of .987903 produces r  .9939 which we have seen from Excel and by hand
previously. Roughly, the coefficient of determination measures the quality of the least squares
regression line for our data. Interpretation of the coefficient of determination can be tricky. This
is best left for a course in regression methods. Generally, the larger the value of r 2 , the “better”
our independent variable is for the purpose of explaining or predicting the outcome of our
dependent variable.

A Confidence Interval for the Slope

The slope of the theoretical trend line, 1 , is a parameter. We have estimated this parameter by
the method of least squares. Based on the ordered pairs  xi , yi  , i  1 n the least squares
estimate of the slope, ˆ1  rs y sx , minimizes the sum of the squared residuals. We have
adopted the act of minimizing squared residuals as our interpretation of the phrase “best fit”. In
all previous case studies, when parameters have been estimated, we have seen it is important to
provide a measure of uncertainty for this estimate. When dealing with data collected from one
quantitative variable, we estimated the population mean  with the sample mean X . However,
a confidence interval was provided. This range of guesses is more informative that the single
point estimate. In the same way, a confidence interval for the slope  1  can be derived.

Recall the linear statistical model Yi  0  1 X i   i where i  1 n . The term  i represents the
scatter that we expect to observe around the line for any particular value of the x-variable.
Obtaining the sample correlation coefficient and the least squares estimates for the slope and
intercept of the line of best fit do not require any assumption or structure to be placed on error
term  i . However, to construct a confidence interval or hypothesis test for a parameter requires
that we know the distribution of a statistic – not just the statistic itself. If we want to construct a
confidence interval for 1 , we will need to make an assumption regarding the error term  i .

Previously, we stated that one would hope that among many observations of the same X value,
the average of all the errors would be 0. Additionally we mentioned that this is typically
assumed when working with linear statistical models. In addition to this assumption, we now
consider that these errors have a normal distribution with the same standard deviation for any X-
value in the range over which the data were taken. If many observations of the y-variable are
taken for the same value of x, then we would expect natural variation in these y-values. Some of
these y-values would yield positive residuals and others would be linked to negative residuals.
However, we now imagine that if these residuals were plotted in a histogram, that they would be
appear to have come from a normally distributed population. This normal population is assumed
to have mean 0 (on average we “hit” the line) and standard deviation  . Finally, the value of 
is presumed to remain constant if we investigate the residuals for a different value of the x-
variable. All of the above assumptions are typical when performing statistical inference for
simple linear regression models. These assumptions are summarized below.

97 | P a g e
Assumptions Generating a Confidence Interval for the Slope Parameter: In order to facilitate
statistical inference on the slope parameter in the linear statistical model Yi  0  1 X i   i , we
assume the error term  i has a normal distribution with mean 0 and standard deviation  for all
i 1 n .

This set of assumptions generated a fourth parameter of interest. We have already estimated the
population correlation coefficient  and the intercept and slope parameters,  0 and 1 . The
parameter  measures the amount of spread around the trend line for any particular value of the
x-variable. When  is small, the ordered pairs will tightly cluster. When  is large, the data
will be more disperse around the trend line. Clearly, the value of  is related to the value of  .
The value of  can be estimated by the value

ˆ  e 2
i
,
n2

where ei , i  1 n is the value of the i th residual when using the least squares regression line.

In the past, we have been able to carry out statistical analysis on parameters by making use of a
theorem which begins the process. Statisticians studying regression have been able to establish
such theorems in the regression setting. The result below is what allows for the creation of a
confidence interval (or hypothesis test) for the slope parameter 1 .

Theorem: Suppose a set of ordered pairs  xi , yi  , i  1 n are observed from the linear statistical
model Yi  0  1 X i   i . Further, suppose the assumptions required for generating a
confidence interval for the slope parameter given above are satisfied. Then, the statistic ˆ has a
1

normal distribution with mean 1 and standard deviation   x  x 


2
i .

In past investigations, we have learned how to standardize statistics that have a normal
distribution. For instance, we standardized X in case 2A and standardized X1  X 2 in Case 3A.
After this standardization, we realized that the standard deviation term involved an unknown
value which needed to be replaced (estimated) by a sample quantity. Doing this produced a
studentized form of the statistic which had a t distribution in each instance. This strategy should
be repeated again on the basis of the above theorem. Again under all the necessary assumptions,
the theorem tells us that
ˆ1  1
Z
   xi  x 
2

has a standard normal distribution. The value of  is unknown, but we have stated above that it
can be estimated by ˆ . Thus, as has been done several times in earlier cases, we can construct a
t-statistic:

98 | P a g e
ˆ1  1
t .
ˆ   xi  x 
2

There are n ordered pairs that were used to form the slope estimate ˆ1 . So, our sample size
(sample of ordered pairs) is n. For this reason, the reader might guess that the appropriate
degrees of freedom for this statistic is n  1 . However, in computing the residuals that comprise
ˆ , we will use the formula for the line of best fit. This formula uses both the slope AND
intercept estimate. That is, two parameters were estimated in order to calculate the residuals –
the two estimates which form the basis for the equation of our “best fit”. Since two parameters
were estimated, the degrees of freedom for the above t-statistic is n  2 .

At this point, we have a statistic and a sampling distribution for the statistic. Thus, we can repeat
the probability statements and algebra required to construct a confidence interval. For a
reminder of this see the section titled “Confidence Interval For a Population Mean Based on the
One-Sample t-test” in Case 2A or the section labeled “Confidence Interval For the Difference in
Two Population Means Based on the Two-Sample t-test” in Case 3A. Applying this same type
work to our current regression setting produces the following.

Confidence Interval For the Slope: Suppose a set of ordered pairs  xi , yi  , i  1 n are observed
from the linear statistical model Yi  0  1 X i   i . Further, suppose the assumptions required
for generating a confidence interval for the slope parameter given above are satisfied. Then, the
100 1    % confidence interval for the slope 1 is


ˆ1   t 2, n2  ˆ  x  x 
i
2
.
As with virtually all calculations associated with correlation and regression, statistical software
and/or Excel can facilitate making the calculations. Previously, we have discussed how the least
squares regression line can be calculated using JMP. Recall, after the scatterplot is printed by
JMP (using the “Fit Y By X” menu), the user can click on the red triangle to select “Fit Line”.
The resulting output from JMP is shown on the following page. JMP gives us a lot to look at,
but we see the scatterplot with the line of best fit superimposed over the data. Additionally,
under the heading “Linear Fit” we see the formula for the least squares regression line. At the
bottom of the output, there is a section titled “Parameter Estimates”. Notice that in the “Acres”
row, the value of the “Std Error” is 0.090977. This is the value of ˆ   x  x 
i
2
 . So, JMP
has given us enough information to deduce that the 100 1    % confidence interval for the
slope 1 is
2.848   t 2, 12  .090977  .

Notice that the degrees of freedom associated with our current confidence interval problem is
n  2  12 . If we desire a 95% confidence interval, we can use t-tables or Excel to conclude that

99 | P a g e
Bivariate Fit of Production (1000 tons) By Acres (100)

Linear Fit
Production (1000 tons) = 12.340793 + 2.848045*Acres (100)

Summary of Fit
RSquare 0.987903
RSquare Adj 0.986895
Root Mean Square Error 37.255
Mean of Response 649.2857
Observations (or Sum Wgts) 14

Analysis of Variance
Source DF Sum of Squares Mean Square F Ratio
Model 1 1360187.6 1360188 980.0084
Error 12 16655.2 1388 Prob > F
C. Total 13 1376842.9 <.0001*

Parameter Estimates
Term Estimate Std Error t Ratio Prob>|t|
Intercept 12.340793 22.652 0.54 0.5959
Acres (100) 2.848045 0.090977 31.31 <.0001*

the correct constant to use in the confidence interval formula is t.025,12  2.179 . Thus, our
confidence interval is 2.848   2.179 .090977  or  2.650, 3.046  . A reasonable range of
guesses for the slope of the theoretical trend line for the Louisiana sugar cane problem is
anywhere from 2.650 to 3.046.

It is possible to obtain the values for the confidence interval directly from JMP rather than
having to piece the solution together as is done above. If the data are entered in a JMP data table
and then the user clicks on Analyze>Fit Model rather than Analyze> Fit Y by X, a confidence
interval for the slope can be output to the screen. After clicking on Fit Model, then the
Production variable should be placed in “Y Role” and the Acreage variable should be clicked on

100 | P a g e
and “Added” to the Model Effects Box. Then, click “Run”. At this point, JMP outputs even
more information than was seen on the previous page. To see the confidence interval for the
slope, red triangle in the upper leftmost part of the output near the word “Response” and finally
click on Regression Reports>Show All Confidence Intervals. The confidence interval is now
printed to the screen under the section marked “Parameter Estimates”. This portion of the output
is shown below. See the values in the ovals.

Parameter Estimates
Term Estimate Std Error t Ratio Prob>|t| Lower 95% Upper 95%
Intercept 12.340793 22.652 0.54 0.5959 -37.01367 61.695259
Acres (100) 2.848045 0.090977 31.31 <.0001* 2.649823 3.046267

Case Question 4A Concluding Decision

A scatterplot of the Louisiana sugar cane data from 14 parishes shows a strong positive linear
association. Having seen this, the sample correlation coefficient is calculated to be r  .9939 .
Continuing, it seems reasonable to obtain a line of best fit for the Case 4A data. This was
accomplished using the method of least squares. The resulting least squares regression line has
formula y  12.341  2.848x , were x represents Acreage (100 Acre Unit) and y represents Sugar
Production (1000 Ton Unit). The slope of the regression line is interpreted as the expected rise
in sugar production for every one unit increase in acreage harvested. One “unit” for the acreage
variable is 100 acres. So, the estimated slope tells us to expect a rise in 2.848 units for the
Production variable for every 100 acres of farm land harvested. The units for Sugar Production
are 1000 tons, so the estimated rise in production is  2.8481000 tons   2848 tons .

So, if the prospective buyer in Case 4A purchases 100 acres of sugar cane farm land, he can
expect a yield of 2848 tons of sugar in one year. Recall that he can generate a revenue of 3 cents
per pound of sugar produced. We need to remember that 1 ton = 2000 pounds. So, 2848 tons is
the same as 5,696,000 pounds. Each pound is worth 3 cents. So, putting this all together, his
estimate of revenue would be  5,696,000.03  $170,880 .

However, we also made some regression assumptions regarding the normality and standard
deviation of the resulting residuals from the regression fit. These assumptions created a 95%
confidence interval for the slope which was  2.650, 3.046  . The bounds of this confidence
interval convert to a range of guesses from 2650 to 3046 tons of sugar produced yearly. Again,
using the conversion that 1 ton = 2000 pounds and each pound is worth 3 cents tells us that the
range of guesses for revenue is $159,000 to $182,760. All of the above is based on three digit
accuracy in estimating the slope and the endpoints of the confidence interval for the slope.

So, is it reasonable to conjecture that the prospective buyer could generate an average revenue of
$200,000 a year for his sugar cane farming efforts on his 100 acres? No, this doesn’t seem
likely. The range of reasonable guesses for his revenue does not include a value this high.
However, an average revenue of $175,000 is more likely. This is a reasonable guess for his

101 | P a g e
annual revenue since it falls in the 95% confidence interval. Realize, based on the data collected,
the best single estimate of his annual revenue is $170,880. On the flip side, if he required a
revenue of $150,000 per year in order to make a profit sugar cane farming, he should feel
encouraged. The confidence interval tells us that our range of guesses lies entirely above this
value. The revenue expected if the buyer devotes the 100 acres of land to sugar cane farming is
$170,880. A confidence interval gives us reasonable bounds on this estimate that extend from
$159,000 to $182,760.

The data analysis in Case 4A was exclusive to revenue. As mentioned in the introduction to
Case 4A, we are ultimately interested in profit. The prospective buyer of the land has the
potential to use the information obtained here and compare it to any analysis of his expected
costs. He will have a cost for the acquiring the land. He will also incur a cost for the seeds or
cuttings that will be planted in the ground. Of course, irrigation is costly and so possibly is labor
required to assist with managing the land. Treatment of the soil and/or fertilization are other
potential costs to the farmer. All of these variables could be studied using the data available to
him. Once costs have been projected in some useable way, he could couple those cost estimates
together with our revenue study to assess profitability. As time went on, the revenue study could
be updated with any additional data that emerges. The same is true with the assessment of cost.
Thus, profitability could continually be investigated through time and changes in the market
incorporated. Many statistical studies are performed by business personnel in order to estimate
or project revenue, cost - and ultimately, profit.

Statistical Follow Up to Case 4A: Hypothesis Test for The Slope

As long we continue to make the assumptions which generate a confidence interval for the slope
parameter, we can construct a hypothesis test if needed. In fact, all of the mathematical work to
accomplish this is completed. In order to test H 0 : 1    versus H A : 1    , we can use the t-
statistic obtained from our previous theorem. Here,   is any conjectured value of the
theoretical slope which would be of interest to investigate. The choice of    0 would be a test
of whether or not the independent variable has any predictive power in regards to the dependent
variable. If we do not reject 1  0 , then we are stating that it is believable that Y  0   . That
is to say, the outcomes of the y-variable just tend to scatter around the constant  0 and are not
linearly related to X at all. The experimenter does not have to choose    0 when performing
the test. Statistically, the procedure is valid for whatever value of the slope is of interest.

To test H 0 : 1    versus H A : 1    , make the regression assumptions discussed earlier and


then use the test statistic
ˆ1   
t .
ˆ   xi  x 
2

Calculate the p-value of the test by using the t distribution with n  2 degrees of freedom. One
sided alternatives use the same test statistic and the same null distribution. The appropriate p-

102 | P a g e
values are then calculated from exclusively the lower or upper tail of the t distribution on the
basis of the direction prescribed by H A . As with confidence intervals, these tests can all be done
with statistical software like JMP.

Other statistical inference procedures are possible to derive in the simple linear regression
scenario that are not presented here. For instance, it is possible to obtain confidence intervals
and hypothesis tests for the intercept parameter. At times this is constructive, but at other times
the independent variable is not observed near zero and so such an investigation is meaningless.
It is also possible to obtain a confidence interval for the expected (or average) outcome of the y-
variable for a particular value of x. In addition, what are called prediction intervals are possible
in order to estimate future y-values for any particular value of x. When these confidence
intervals for the expected value of Y or prediction intervals for future values of Y are done for
many values of the x-variable, the result is a confidence band for the expected value of Y or a
prediction band for a future outcome of y. These are shown with dashed lines in the figure
below for the Louisiana sugar cane data. The set of bands closest to the line of best fit are the
confidence bands for the expected outcome of Y. The outer set of bands are the prediction bands
for future values of y at any particular x-value.

Finally, recall one of our assumptions regarding statistical inference in simple linear regression
involves the normality of the residuals. The residuals from the least squares regression line for
the data of Case 4A can be plotted in a histogram. See the figure below. The 14 residuals for the
Louisiana sugar cane data show no obvious departure from normality. This is at least one visual
check regarding the assumptions generating a confidence interval or test for the slope. . There
are other diagnostics that researchers using regression methods can employ. These regression

103 | P a g e
diagnostics as well as the equations for the other procedures alluded to above are best relegated
to future study of correlation and regression and so further detail is not provided here. Creating
plots of residuals, formulas for additional confidence intervals, procedures for drawing
prediction bands, and other regression techniques are readily available in other textbooks. Entire
courses can be taught on the subject of regression analysis in which all of the material hinted at
here is covered along with many additional data analysis strategies.

Residuals: Production (1000 tons)

Concepts and Terms To Review from Case Question 4A:

covariance error term in linear model


population sample covariance
sample sample correlation coefficient  r 
paired data computational shortcut for the sample correlation
coefficient
correlation residual
regression method of least squares
dependent variable formula for least squares regression line
independent variable regression effect
multiple regression problem extrapolation
simple regression problem coefficient of determination
scatterplot assumptions concerning a confidence interval for
the slope
linear trend confidence interval for the slope
simple linear regression model test statistic for hypothesis test on slope

104 | P a g e
positive correlation null hypothesis
negative correlation alternative hypothesis
correlation coefficient p-value
least squares regression line confidence band for expected value
parameter prediction band for future value
statistic
rho   
slope
intercept

105 | P a g e
Case Study #5A: Should You Admit Your Guilt?
Introduction

An inevitable scenario playing out between parent and child is that of the youngster unknowingly
trapped in their own wrongdoing. Confronted by the parent, the child is faced with telling a
falsehood in the hopes of “getting away with it” or telling the truth and possibly receiving lighter
punishment. All the while, the parent knows full well the guilt of the child and waits to hear
either truth or lie. When only adults are involved and the stakes are higher – say, going to the
state penitentiary or not, the issue remains of whether it pays to admit one’s guilt. If one has
little hope of “getting away with it”, will a criminal receive a lighter punishment if they admit
their guilt and avoid a trial? Should a lawyer counsel a client facing damning evidence to admit
their guilt and hope for mercy? Then again, does it even matter? The data set explored in this
case looks at two groups of people. One group admits their guilt and the second group is found
to be guilty after a trial. Our issue is the severity of the sentence they receive. We will
investigate whether those who fess up in the first place obtain lighter sentences among the
population under study.

Case Question 5A

Over approximately a one year period, data on 255 people accused of robbery in the greater San
Francisco area is examined. All 255 people in the study have a lengthy criminal history and face
the possibility of being sent to California state prison. The evidence against the accused was
thought to be quite substantial in each of the 255 cases. Everyone in this study was determined
to be guilty in one of two ways: by their admission or by trial. Of the 255 people, 191 decided
to plead guilty and avoid a trial all together. In some sense, these criminals wish to simply avoid
a prison sentence. The hope among these 191 people is that they will be granted some form of
mercy – avoiding state prison – by admitting their guilt up front. The other 64 people in the
study went to trial and were found guilty. Of the 191 admitting their guilt, 101 were sentenced to
state prison. Of the 64 going to trial, 56 wound up in prison. Is this evidence that admission of
guilt results in a lesser chance of being sentenced to state prison?

Populations and Samples

Similar to Case 3A, there are actually two populations of interest in Case 5A. Similar to Case 1A
and 1B, our current problem deals with proportions. The two groups being compared in the
current problem differ in how guilt is determined. For some, guilt is determined by their own
admission. For others, they have plead innocent, but then are found guilty through the process of
having a trial. This is the only real difference in the two populations. All people being studied
are from San Francisco, have an extensive prior record, and have a large amount of evidence
mounted against them in regards to their robbery accusation.

It is important to realize at this point that there were more than 255 people with prior records
accused of robbery in San Francisco over the study period. We have information regarding
admission of guilt and eventual sentence on only 255 people – a fraction of the actual robbery
cases. Therefore, we will define the two populations as follows. The first population consists of

106 | P a g e
all people with criminal records that were accused of robbery in the greater San Francisco area
and admitted their guilt. The data presented in Case Question 5A focuses on a part of this
population. The 191 people in the case who admitted their guilt do not represent the entire first
population. There were other people in the time frame and geography that admitted to robbery.
However, we don’t know how many and we don’t know their eventual fate as it pertains to
prison. Nevertheless, we want to use the first sample of 191 people that we do have information
on to make a judgment about the entire first population.

The second population being studied consists of all people with criminal records that were
accused of robbery in the greater San Francisco area and were found guilty by trial. We only
have 64 members of this population to form the second sample. Over the years’ time frame,
there were others that were accused of robbery with prior records and went to trial. Many of
these people were surely found guilty and potentially, some of them sent to prison. We don’t
know how many and their eventual fate as it pertains to prison. We only know that of the 64
who were found guilty, 56 were sentenced to prison. Again, we want to use this second sample
of 64 people that we do have information on to make a judgment about the entire second
population.

To summarize:

Population 1: All accused of robbery in the San Francisco area with a prior record and plead
guilty.

Population 2: All accused of robbery in the San Francisco area with a prior record and plead
innocent, but were found guilty by trial.

Sample 1: The 191 people in the study that plead guilty to robbery.  n1  191
Sample 2: The 64 people in the study that were found guilty by trial.  n2  64 

Random Variables, Parameters and Statistics of Interest

Remember from Case 1A and 1B that each person being investigated in the postponement of
death theory scenario was assigned a category of “yes” or “no”. In that case study, the person in
question either did or did not die within a window of three months of their birthday. The
situation is similar now. In Case 5A, each person being investigated either did or did not get
sentenced to state prison. Thus, the data in each sample can be viewed as the results of Bernoulli
trials.

Recall that a random variable is an assignment – one that obeys the laws of a function – in
which each experimental result is assigned a meaningful number. In the case of a Bernoulli
random variable, we typically take these meaningful numbers to either be 1 (success) or 0
(failure). Since we have two samples representing those that admitted guilt and those that were
found guilty, we will define two sets of Bernoulli random variables. The letter “X” will
correspond to the first sample – those who admitted guilt. Likewise, the letter “Y” will be used
for those that were found guilty.

107 | P a g e
We have 191 people in the first sample and so we define

1 if the i th person admitting their guilt is sent to prison


Xi   th
,
0 if the i person admitting their guilt is NOT sent to prison

i  1 191 . We have 64 people in the second sample representing those who plead innocent, but
were subsequently found guilty by a trial. Some of these people were a “success” (sent to
prison) and some were categorized as a “failure” (not sent to prison). As we discussed in Case
1A and 1B, the term “success” corresponds to the category of interest. This explains why it may
seem odd to say it is a “success” for someone to be sent to prison. We say this simply because
this is the focus of our investigation. Thus, similar to X1 , X 2 , X191 , we define

1 if the jth person found guilty by trial is sent to prison


Yj   th
,
0 if the j person found guilty by trial is NOT sent to prison

j  1 64 . Each of the 252 people studied in Case 5A represent a Bernoulli trial. These 252
people are divided into two samples and within each sample we have some trials that are
successful and some that are not.

Ultimately, the decision about whether or not admission of guilt results in a lesser chance of
being sentenced to prison will be driven by the proportion of prison sentences handed out to the
people within our samples. We have seen this theme all throughout the prior case studies.
Namely, statistical inference involves decision making regarding a population of objects by
using relevant data from samples. When working with Bernoulli trials, we tend to focus on
either the total number of successful trials or the proportion of successes. Toward this end,
remember that a parameter is a numerical characteristic of a population. Similar to Case 3A,
we have two populations being compared in Case 5A and so we have two parameters to define.
Similar to Case 1A and 1B, our parameters of interest are population proportions.

The parameters relevant to Case 5A are denoted p1 and p2 :

Let p1  the proportion of all people sentenced to state prison that were accused of robbery in
the San Francisco area with a prior record and plead guilty.

Likewise, let p2  the proportion of all people sentenced to state prison that were accused of
robbery in the San Francisco area with a prior record and found guilty by trial.

Parameters are generally unknown quantities. They are estimated by statistics which can be
calculated from the information gathered in a sample. In Case 1A, we used the notation p̂ to
represent the sample proportion which estimated the population proportion p. Now, we need to
estimate p1 and p2 by the two sample proportions p̂1 and p̂2 , respectively.

108 | P a g e
Let p̂1 be the proportion of people in our (first) collected sample that were sent to state prison
among those who admitted their guilt.

Let p̂2 be the proportion of people in our (second) collected sample that were sent to state prison
among those who were found guilty by trial.

If we return to Case Question 5A, we see that we have been given the information necessary to
calculate the values of p̂1 and p̂2 . Among the 191 people who admitted their guilt, 101 were
sent to prison. Hence, pˆ1  101 191 . Also, among the 64 people who chose to go to trial and
were eventually found guilty, 56 of them were sent to prison. Therefore, pˆ 2  56 64 . Pausing
for just a moment to examine these two sample proportions, we realize that p̂1 is roughly 52.9%
whereas p̂2 is 87.5%.

The difference in the sample proportions is over 34%. As an isolated statement, this appears to
be a large discrepancy. However, the sample proportions are based on only a fraction of the
population. They are based on samples of size 191 and 64, respectively. Is the difference in the
proportions (34%) so large that it could not be explained by chance alone? Is it possible that
other representative samples – if taken, would show little to no discrepancy? Based on the 252
people comprising the samples, we need to determine how likely it is to see such a discrepancy.
This means, we must move toward setting up a hypothesis, obtaining a sampling distribution and
ultimately calculating a p-value. In our context, this process is called statistical inference for
two proportions. Data for this type of problem is often represented in a 2 by 2 contingency
table. The rows in the table represent the two samples collected. The columns in the table
represent the two possible Bernoulli trial categorizations: prison or not. The contingency table
for Case 5A is given below.

Prison Not Sent to Prison Total

Admit Guilt 101 90 191


Found Guilty 56 8 64

Total 157 98 255

Like is done in the table above, contingency tables often include the totals in the margins. When
speaking of a 2 by 2 table, we mean that the body of the table (not the margins) has two rows and
two columns. Looking directly at the contingency table for Case 5A, we can “see” that
pˆ1  101 191 and pˆ 2  56 64 . Scan your eyes across the rows to form these two fractions.
There is another key piece of information that can be obtained by using the row totals. Notice
that overall, the fraction of people that were sent to state prison is 157 / 255 . This fraction
represents a pooling of the information in both samples. The careful reader will recall that
pooling was central to Case 3A when we first encountered a two sample investigation. Pooling
will again play a role in the resolution of Case 5A.

109 | P a g e
Hypotheses To be Tested

The key issue in Case 5A is whether or not the admission of guilt tends to result in a lesser
sentence. Namely, does the admission of guilt result in a lesser chance of being sent to prison?
If there is no difference in the prison sentence rate for those that admit guilt when compared to
those that are found guilty by trial, then p1  p2 . The null hypothesis is the conjecture of no
change. If there is no change between the population admitting guilt and the population that is
found guilty then the two population proportions will be equivalent. However, if admitting one’s
guilt tends to result in a lesser chance of going to prison, then this would mean that p1  p2 .
These are the two competing hypotheses in Case 5A. Stated symbolically, the alternative
hypothesis states that the chance of going to prison if you admit guilt  p1  is less than    the
chance of going to prison if you are found guilty by trial  p2  .

When dealing with two population means ( 1 and  2 ) in Case 3A, we often stated that
H 0 : 1  2 . We sometimes wrote this null hypothesis as 1  2  0 to emphasize that there
was no (zero!) conjectured change between the two population means. Thinking along this same
line we can state the null and alternative hypothesis for Case 5A as:

H 0 : p1  p2 (this is the same as saying p1  p2  0 )


H A : p1  p2 (this is the same as saying p1  p2  0 )

Development of a Test Statistic and the Central Limit Theorem Revisited

When developing a test statistic for the statistical inference problem for a single proportion in
Case 1A, we used the binomial distribution. When transitioning to Case 2A, we began to look at
utilizing averages in test statistics. Creation of an appropriate test statistic in Case 5A begins
with the realization that p̂1 and p̂2 - while sample proportions – are also actually averages.
Sample proportions are averages of zeroes and ones. The 191 outcomes of the random variables
X1 , X 2 , X191 consist of 101 ones and 90 zeroes. When 101 ones and 90 zeroes are averaged
the result is pˆ1  101 191 . The same argument can be made for the outcomes of Y1 , Y2 , Y64 and
pˆ 2  56 64 . Thus, similar to Case 3A, the comparison of two sample proportions is actually the
comparison of two special types of sample averages.

When comparing two averages based on data from continuous random variables in Case 3A, the
key statistic to begin with was X1  X 2 . Now, the basic statistic forming the comparison will be
pˆ1  pˆ 2 . If this statistic is near zero, we have support for the null hypothesis that p1  p2  0 . If
pˆ1  pˆ 2 is sufficiently negative (as judged by the forthcoming p-value) then we have evidence
for the alternative hypothesis in Case 5A.

110 | P a g e
A fundamental step in developing the test statistic for Case 3A involved the theorem providing
the distribution of X1  X 2 in the case that we sampled two independent normally distributed
populations. See page 72 for this result. Now, our data are Bernoulli trials - not outcomes from
a continuous and normal random variable. Nonetheless, because of the similarities of Case 5A
and Case 3A (two independent samples dealing with a form of averages), it makes sense to try
and mimic the statistic in Case 3A in some way. The statistic which was ultimately obtained in
the two independent sample problem of Case 3A is

X 1  X 2    1  2 
.
s 2p n1  s 2p n2

Here, for emphasis, the pooled standard deviation has been brought back underneath the radical
and appropriately squared. In Case 3A (see page 75), the value of s 2p was factored out of each
term and then properties of square roots used to write the final arrangement. Just be sure and
realize that the formula above is algebraically equivalent to what appears in Case 3A (page 75).

This cannot be our test statistic for Case 5A since we are using pˆ1  pˆ 2 . Additionally, we will
need to pool in a different way than that which created s 2p since our data consists of outcomes on
Bernoulli trials and not normal random variables. But, toward this objective, we have already
discussed a pooled statistic in the context of the two by two contingency table. Recall that the
pooled (overall) fraction of people that were sent to prison is 157 / 255 . This pooled sample
proportion is denoted pˆ  157 / 255 .

At this point, it has been suggested to use pˆ1  pˆ 2 in the place of X1  X 2 . Of course, this would
mean that p1  p2 should replace 1  2 . However, recall that we are hypothesizing that
p1  p2  0 , so similar to what was done in Case 3A, we will have a “0” in this slot of our test
statistic. Lastly, we have suggested that the pooled sample proportion p̂ be incorporated in
some way. If this is done, then our test statistic of Case 5A will be a mirror to that previously
constructed in Case 3A.

The last step in piecing together our test statistic for the hypothesis test regarding two
proportions is to remember that 1) s p was a pooled standard deviation and 2) obtaining the
standard deviation of an average requires dividing by the square root of the sample size. This
last fact was first seen in Case 2A and 2B. Go back and review the statements made below the
first theorem presented on page 61.

Since we are dealing with Bernoulli trials, the total number of successes among a group of such
trials has a Binomial distribution. The standard deviation of a binomial variable was seen to be
np 1 p  if the number of trials is n. This can be reviewed by looking back at the statement
on page 23 in the middle of Case 1B. Proportions are averages and the standard deviation of an
average is obtained by dividing by the square root of the sample size. When this is done, we

111 | P a g e
obtain simply p 1 p  . In Case 5A, the value of p represents the pooled population
proportion. That is, the fraction of people going to prison when considering all of the criminals
committing robbery in San Francisco. We don’t know this parameter, but clearly the pooled
sample proportion p̂ is an estimate for p and therefore pˆ 1 pˆ  is an estimate for p 1 p  .
This last statistic is the appropriate substitute for s p .

Finally, since pˆ 1 pˆ  is the appropriate quantity to use in place of s p , we can replace s 2p with
pˆ 1 pˆ  in the statistic

X 1  X 2    1  2 
. [Case 3A]
s 2p n1  s 2p n2

Making all of the changes for Case 5A, we can now see that the test statistic formed for the
hypothesis test regarding two population proportions is

 pˆ1  pˆ 2    p1  p2  . [Case 5A]


pˆ 1  pˆ  pˆ 1  pˆ 

n1 n2

It is very important to notice how similar the Case 5A test statistic is to the test statistic of Case
3A. Realize that the test statistic of Case 5A uses information from three sample proportions:
the sample proportion from the first sample, the sample proportion from the second sample and
the pooled sample proportion. These can all be easily “seen” in the two by two contingency
table as previously discussed.

Before calculations can be made to resolve Case 5A, we must obtain the sampling distribution
of our new Case 5A test statistic. Recall that the sampling distribution of a test statistic when the
null hypothesis is assumed true is called a null distribution. So, what is the null distribution of
our Case 5A test statistic? The answer to this question could be difficult to pin down for small
samples. But, we know that proportions are special types of averages and the Central Limit
Theorem tells us that averages have approximately normal distributions when the sample sizes
are large (see page 59). Thus, the Central Limit Theorem rescues us from any difficulties
regarding the mathematics of developing the appropriate null distribution. When the sample
sizes n1 and n2 are sufficiently large, we can state that

 pˆ1  pˆ 2    p1  p2 
pˆ 1  pˆ  pˆ 1  pˆ 

n1 n2

112 | P a g e
is approximately normally distributed when H 0 : p1  p2 is true. This application of the Central
Limit Theorem wraps up all of the concepts behind our statistical inference problem for two
proportions. Our test statistic can now be calculated and Case Question 5A resolved using the
standard normal distribution as our pathway toward an approximate p-value.

Case Question 5A Concluding Decision

Calculations are best done by once again reflecting on the contingency table for the robbery data:

Prison Not Sent to Prison Total

Admit Guilt 101 90 191


Found Guilty 56 8 64

Total 157 98 255

From this table, we again see that pˆ1  101 191 , pˆ 2  56 64 and pˆ  157 / 255 . Our null
hypothesis is that p1  p2  0 and so the calculated value of the test statistic is

 pˆ1  pˆ 2    p1  p2  
101 191  56 64    0    4.93 .
pˆ 1  pˆ  pˆ 1  pˆ 

 157
255  255 
98
 157  98 
 255 255
n1 n2 191 64

The approximate null distribution of our test statistic is standard normal and because the
alternative hypothesis is left-tailed  H A : p1  p2  , our approximate p-value is P  Z   4.93 .
The z-score of -4.93 is “off the chart” small. The chance that a standard normal random variable
will result in an outcome nearly five standard deviations below average is incredibly low.
Indeed, our p-value is quite rare and certainly below any reasonable significance level that could
have been agreed upon based on analysis of Type I and II errors.

The Excel code “=NORM.DIST(-4.93,0,1,TRUE)” results in the p-value being approximated at


.00000041. Such a rare p-value – even knowing this is approximate – causes us to conclude that
sufficient evidence for the alternative hypothesis exists. The p-value is rare and the original null
hypothesis should be rejected.

Rejecting H 0 : p1  p2 and instead concluding that H A : p1  p2 is most reasonable translates to


the belief that those robbers admitting their guilt tend to have a smaller chance of being
sentenced to prison. There is sufficient evidence in the data from San Francisco to conclude that
one should admit their guilt if they wish to reduce their chance of being sent to California state
prison. We have now concluded that the large (34%) discrepancy in the sample proportions
cannot be reasonably explained by chance alone. This gap between p̂1 and p̂2 is statistically
significant. So statistically significant that we say with strong conviction that p1  p2 and that
those admitting their guilt stand to have a lesser chance of going to prison.

113 | P a g e
Finally, as we have stated throughout all of the cases, we should say that our conclusion depends
on our samples being representative of the populations as a whole. Additionally, our conclusion
is confined to the specific area, crime and time frame over which the data were taken. It would
be dangerous to extrapolate or extend our results to other cities, crimes and time periods without
additional data.

However, regarding the year over which our data were collected and with regards to robbery
suspects with prior records in San Francisco, it appears as though some measure of mercy is
given to those who admit their crime and do not proceed with the formalities of court case where
the evidence is severely mounted against the defendant. The p-value to this effect is roughly
represented by the fraction given by 4 in 10 million.

Concepts and Terms to Review from Case Question 5A

Population sampling distribution


Sample pooled standard deviation
Random variable pooled sample proportion
Bernoulli trial pooled population proportion
Parameter standard deviation of a binomial random variable
Statistic test statistic for the hypothesis test regarding two
population proportions
population proportion p-value
sample proportion null distribution
statistical inference for two proportions Central Limit Theorem
contingency table Type I error
pooling Type II error
null hypothesis significance level  
alternative hypothesis

114 | P a g e

You might also like