Math 321 - Statistics

Introduction: What is Statistics?
Definition: Statistics is the science of

measurement and decision-making under
conditions of uncertainty, randomness,
and variability.
More briefly: Statistics is the field of

dealing with data.
Math 321 - Dr. Minnotte
In statistics, we make observations, to

collect information, to help make
decisions.
If that sounds familiar, it should. We do

that sort of thing every day, in every field
of study, and in our everyday life.
In statistics, we simply formalize this

process mathematically. This allows us to
recognize smaller differences than might
otherwise be found, and to make decisions
under conditions of greater uncertainty.
The term statistic is also used to describe

any bit of numerical information, like the 6.3%
unemployment rate in April, 2014 or the
15,143 students enrolled at UND in Fall, 2013.
These numerical bits of data are thrown at us

every time we read the newspaper, or watch
TV news, or read a journal in our field.
Just as words should be read with

understanding, so should statistics. If we
uncritically accept the numbers others give us,
we open ourselves to believing
misinformation.
Statistics are an important tool in almost

every field. In this class, well look at
examples like:
How can doctors tell if a new vaccine really

works?
How can irrigation engineers use past river flow
rates to predict future flows?
How can polltakers use responses from a few
thousand voters to predict the results of an
election in which more than a hundred million
people vote?
What are some other examples of statistics in

practice?
The Challenger Disaster:

A Statistical Cautionary Tale
In 1986, a lack of statistical thinking

contributed to a tragedy: the explosion of the
space shuttle Challenger.
The destruction of the Challenger killed

seven astronauts, including Christa
McAuliffe, a 37-year-old teacher selected to
be the first teacher in space, and set the U.S.
manned space program back several years.
The solid rocket motors used to launch the

space shuttles are shipped to the Kennedy
Space Center in four pieces. Large rubber
O-rings are used to seal the three joints
between the pieces.
The Challenger explosion occurred when one

of the O-rings failed to seal quickly enough to
prevent hot gasses from escaping from the
rocket and igniting the large external fuel
tank.
Implicated in the failure was the unusually

cold (for Florida) launch temperature of 29F.
The night before the launch, forecasters

predicted a temperature of 31F for the
launch time.
A three-hour teleconference took place

between people at:
Morton Thiokol (manufacturer of the rocket

motors)
Marshall Space Flight Center (NASA center
for motor design control), and
Kennedy Space Center.
There was concern that the cold

temperatures could lead to problems with
the O-rings.
In 7 out of 23 previous launches, some Oring damage had occurred.
Some participants recommended delaying

the launch until the temperature rose
above 53F, the lowest previous launch
temperature, in which the greatest number
of damaged O-rings occurred.
10
In the end, the recommendation was made

to launch on schedule, in part because of
the following plot.
The plot shows temperature vs. number of

damaged O-rings for the 7 affected
launches.
The relationship seems limited, at most.
What error was made preparing this plot?
11
12
13
By only including the launches in which

incidents occurred, the investigators left
out some important information!
When the data from all 23 launches is

plotted, a temperature dependence
becomes obvious.
All of the 4 launches below 66F had damage.
Only 3 out 16 flights above that temperature
suffered damage.
Note where 31F or 29F would appear on

that plot.
14
More sophisticated analyses are possible,

but unnecessary.
Had the concerned engineers presented

the complete data in such a format, they
might well have convinced the decisionmakers to delay the launch and prevented
the tragedy.
Theres more to this story, so well return

to it later in the semester.
15
Chapter 1: Univariate Data Populations and Samples

Definition: A population consists of all
potential observations from a distribution
of interest.
In an enumerative study, the population will

be tangible, real and finite, and might be
represented by a sampling frame listing the
members of the population.
o
Examples include populations of people, or

corporations, or items in a shipment.
16
In an analytic study, we study an ongoing

process, and the conceptual population is
infinite and simply a useful theoretical
construct. No sampling frame is possible.
o
Examples include populations of rainfall over time,

or objects coming off an ongoing assembly line, or
repeated measurements of the same underlying
weight.
As an investigator, you have a great deal

of flexibility in defining the population of
interest.
17
Example: We are interested in the ages of

UND students. What are some possible
relevant populations?
Example: A quality engineer wishes to

study the volume of milk in containers
coming off a production line. What are
possible populations?
Example: We wish to examine the

incidence of obesity in preteen children.
What is an appropriate population?
18
Once we have defined our population, we

take a sample from that population.
Measurements from each member of the

sample will be the observations which
make up the dataset we will analyze.
Example: Student ages.

19
Experiments
Suppose that a chemical engineer wants

to determine how the concentration of a
catalyst affects the yield of a process.
The engineer can run the process several

times, changing the concentration each
time and compare the yields that result.
This sort of experiment is called a

controlled experiment because the values
of the concentration variable are under the
control of the experimenter.
20
Observational Studies
There are many situations in which scientists

cannot control the variables of interest.
Many studies have been conducted to

determine the effect of cigarette smoking on
the risk of lung cancer. In these studies,
rates of cancer among smokers are
compared with rates among nonsmokers.
The experimenter cannot control who

smokes and who doesnt.
This kind of study is called an observational

study.
21
When we study a sample, we must make

sure it is representative of the population.
One option is a census, or complete

enumeration, of everyone in the
population. What are some problems with
this approach?
22
Usually, the best solution is to take a random

sample, choosing your sample with planned
probability methods.
The most basic such method is called a
simple random sample (SRS).
In a SRS, we draw individuals out of the
population with the equivalent of drawing
names out of a (well-mixed) hat.
Each subset of the population of the
appropriate size is equally likely to make up
the sample.
This is theoretically convenient, but often
hard to arrange in practice.
23
When viewed in order, or over time, the

observations of a SRS should not show
any noticeable pattern or trend.
24
A SRS is not guaranteed to reflect the

population perfectly.
SRSs always differ in some ways from

each other; occasionally a sample is
substantially different from the population.
This phenomenon is known as sampling

variation.
25
The items in a sample are independent if

knowing the values of some of the items
does not help to predict the values of the
others.
Items in a simple random sample may be

treated as independent in most cases
encountered in practice. The exception
occurs when the population is finite and
the sample comprises a large fraction
(more than 5%) of the population.
26
Samples of Convenience
A nonrandom sample, or sample of

convenience, may be easier to collect, but
may be nonrepresentative in some
important ways.
Such a sample may bias your results,

making them worthless (or at least a whole
lot less trustworthy).
27
Example: We are interested in the size of

hometowns for all U.S. college students,
but only sample at UND.
Example: We want to survey UND

students on math anxiety, and pick a class
to interview:
Math 321?
Upper-division English?
28
Example: Not everyone will consent to test

a new AIDS vaccine. We could give those
who consent the vaccine, and leave those
who dont alone to be the control group.
What about a historical control (compare

vaccinated group with past infection
rates)?
29
Terminology and Notation
From each individual person or object in

our sample, we are generally interested
only in a small number of characteristics.
Each characteristic we record will be

called a variable, and assigned a letter
from the end of the alphabet.
30
Data that we collect may be of two main

types:
1)
Categorical classifying the subject into one

of several distinct groups.
o
o
o
2)
X = Sex
T = Hair Color
W = Zip Code
Numerical data recorded as a number,

where operations like averages make sense.
o
o
o
Y = Age
U = Rainfall
Z = Volume of milk
31
10
We also classify datasets based on how

many variables we measure on each
individual.
If we only collect a single variable (e.g. age),

we say the dataset is univariate.
If we collect two variables for each individual

(e.g. age and sex), we say it is bivariate.
With still more variables, we say that it is

trivariate, quadrivariate, and so on, or more
commonly, that it is multivariate.
32
We often use subscripts on the variable

name (letter) to indicate specific
observations in a dataset, such as X1, X2,
, Xn.
A subscript of i (occasionally j or k)
indicates a specific, but arbitrary,
observation.
We usually reserve the label n for the

number of observations (the sample size).
33
There are two primary branches of

statistics:
1)
Descriptive statistics simply attempts to

simplify and understand a dataset.
2)
Inferential statistics attempts to say (infer)

something about the broader population or
distribution from which the data was
drawn.
Descriptive statistics are simpler, so well

start there.
34
11
Summary Statistics (1.2)
Given data X1, X2, , Xn, we frequently use

sample statistics to summarize the dataset.
A statistic is anything which may be

calculated from a dataset. A sample statistic
simply makes clear that it derives from a
sample.
Use of sample statistics can improve our

understanding of the data, as well as make it
easier to communicate with others about it.
35
The Sample Mean
The most important feature of a dataset to

describe is generally its location, or the
location of its center.
The most commonly used statistic for center

is the familiar average, or sample mean.
Definition: The sample mean of data X1, X2,

, Xn is
36
Example: Stocks:
37
12
To understand how the mean works,

suppose we were to take a very thin
yardstick or similarly marked board, and
place a small (equal) weight at the mark
for each observations value.
The mean may be thought of as the point

where this would balance.
38
Outliers
An outlier is an observation which is very

different from the rest of the sample. For
univariate data, this means it is much larger
or much smaller than the rest.
Outliers should be carefully examined. Often

they are the result of measurement or
recording errors.
If so, they should be fixed or deleted. Correct

but unusual values, however, should be kept.
39
The sample mean is not robust (resistant

to outliers). Changing even one
observation can change the sample mean
as much as we want.
Example: Mistype the final stock return as

374 (instead of 37.4). What is the sample
mean now?
40
13
Measures of Variability
After center, the second-most-used

feature to describe a sample is its
variability, or spread.
41
The simplest measure of variability is the

range, the difference between the
maximum and minimum values.
R = max(X) min(X)
Unfortunately, the range both wastes most

of the data, and is maximally non-robust,
using only the two extreme data points, so
it is rarely used.
42
A better solution looks at the deviations

from the mean,
This removes the
effect of the mean (location), and looks
only at the variability around the mean.
One option: Look at the average deviation

from the mean.
Problem: Positive deviations cancel out

negative ones, and the average deviation
from the mean is always 0.
43
14
We could take absolute values of the

deviations, but for a few theoretical
reasons, its better to look at the squared
deviations instead.
Definition: The sample variance, s2,

measures the spread of a dataset.
Definition: The sample standard deviation,

s, is the square root of the sample
variance.
44
Use of the definition formula is tedious, as

it requires finding and squaring each of the
n deviations from the mean.
It is usually simpler to calculate s2 using

the following computation formula.
45
Example: What are the variance and

standard deviation of the stocks data?
46
15
The sample variance and standard

deviation are measures of the spread of a
dataset, and estimates of the variance and
standard deviation of the underlying
population or distribution.
Like the sample mean, they are not robust.

Example: Stocks, replace 37.4 with 374:
s2 = ?
s=?
While very useful practically and

theoretically, the variance and standard
deviation are a little tricky intuitively.
One helpful rule of thumb:
47
About 2/3 of data should fall in

About 95% of data should fall in
Almost all data should fall in
Example: Stock data:
48
If X1, , Xn is a sample, and Yi = a + b Xi,

where a and b are constants, then
This is most commonly needed if we

change units for our data.
49
16
Example: Let X1,,Xn be a sample of

temperatures measured in degrees
Celsius, with = 30. Let Y1,,Yn be the
same temperatures in degrees Fahrenheit,
Yi = 9/5 Xi + 32. What is ?
Example: Let the variance of the Celsius

temperatures be
= 25.
What is the standard deviation?

What is the variance of the Fahrenheit
temperatures? The s.d.?
50
Order Statistics and Robust

Measures of Center and Spread
Definition: The ith order statistic, X(i), is the

ith smallest value when the Xs are sorted.
The minimum is X(1), the second smallest
X(2), and so on up to the maximum, X(n).
51
Example: Stock data (sorted):
X(1) = -7.2, X(4) = 1.3, X(20) = 37.4, and so

on.
Because outliers will always be in the first

or last few order statistics, values
computed from middle order statistics will
be very robust.
52
17
Definition: The sample median,

middle of the sorted data.
, is the
If n is odd, the sample median is the (n+1)/2th

order statistic.
If n is even, it is the average of the n/2th and

(n+2)/2th order statistics.
Example: Stocks:
=?
53
The sample median has 50% of the data

on either side of it.
The sample median is very robust;

changing one or a few observations wont
change it much, if at all.
Example: Stocks: Replace 37.4 with 374,

and the sample median remains 17.6
54
Quartiles
The quartiles of the data divide the sample

into quarters.
The first quartile, Q1, splits the lowest quarter

of the sample from the rest.
If (n+1)/4 is an integer, Q1 is the (n+1)/4 order

statistic.
If (n+1)/4 is not an integer, Q1 is the average of
the two order statistics on either side.
The third quartile, Q3, splits the highest

quarter from the rest.
Find it as Q1, but using 3(n+1)/4.

55
18
Example: Sorted stocks:
Q1 = ?
Q3 = ?
Definition: The sample interquartile range

is a robust measure of spread, found as
the difference between the sample
quartiles, IQR = Q3 Q1.
Example: Stocks: IQR = ?
Note: Changing 37.4 to 374 doesnt

change Q1, Q3, or IQR.
56
57
Percentiles
Definition: The pth sample percentile, has

(roughly) p% of the data below it, and
(100-p)% above it.
Compute p(n + 1)/100. If this is an

integer, use that order statistic. If not,
average the two closest order statistics.
The median and quartiles are just special

names for the 50th, 25th, and 75th
percentiles.
58
19
Example: Descriptive Statistics in Minitab
Descriptive Statistics: Stock Returns 1976-1995

Variable
Stock Returns 19
Variable
Stock Returns 19
Mean
StDev
Variance
Minimum
Q1
Median
Q3
Maximum
15.37
13.66
186.49
-7.20
5.48
17.60
28.90
37.40
IQR
23.43
59
Basic Statistical Graphics (1.3)
Some of the most powerful tools available

for understanding a dataset are graphics
which we can use to look at our data.
Its very hard to get much useful out of

large tables or long columns of numbers.
But the human eye is very good at picking
out patterns in pictures.
60
Bar Charts
Given categorical data, the most useful

plot available is usually a simple bar chart.
A bar is drawn for each category, with the

height proportional to the count
(frequency) or percentage found in that
category.
Other measurements for each category

may also be compared.
61
20
Example: Television Picture Grades
Perfect, Good, Satisfactory, Fail
Category
Perfect
62
Count
64
Good
Satisfactory
Fail
47
33
6
Total
150
63
Spaces between the bars show

categories.
Bars should start at 0 and show full height

(no truncation!). Otherwise, relative
heights get distorted.
64
21
65
Unless there is a strong natural ordering

(e.g. poor-fair-good-excellent; not
alphabetical), bars should be sorted in
ascending or descending order. This
makes comparisons between close values
much easier.
66
67
22
Many categories or long category names

may be better served by horizontal bars.
68
69
3-D perspective looks fancy but hurts

clarity usually a bad idea.
70
23
A stacked bar chart includes a second

categorical variable, but focuses on the
totals for the main category of the bars.
Individuals on the Titanic
1000
900
800
700
600
500
400
300
200
100
0
Survived
Died
1st Class
2nd
Class
3rd
Class
Crew
A clustered bar chart focuses on the

counts of the specific combinations of
categories, and is useful for comparing the
distribution of one variable for different
values of the other.
800
700
600
500
400
300
200
100
0
Died
Survived
1st
Class
2nd
Class
3rd
Class
Crew
71
72
Example Minitab Bar Charts
73
24
74
75
Pie Charts
The other common chart for categorical

data.
A pie chart should only be used when the

categories represent (all of the) parts of
some whole, and so should always plot
percentages.
76
25
Each categorys slice gets an angle

equal to
Comparing angles is much more difficult

than comparing heights or lengths. Bar
charts are almost always more effective.
3-D pie charts are the work of the devil.

(Probably worse than no chart.)
77
78
79
Minitab:
26
Dotplots
Dotplots are simple plots which are very

useful for looking at univariate numeric
data, especially when the sample size is
small or there are many ties in the data.
Each observation is plotted at its location

above an appropriate number line. If there
are ties, one dot is stacked for each tied
observation.
80
Example: Temperature (F) at launch of

the first 25 space shuttle launches.
66
70
69
80
68
67
72
73
70
57
63
78
70
67
53
75
67
70
81
76
79
75
76
58
31
81
Histograms
A histogram is a bar chart for numerical

data.
The shape of the histogram describes the

shape of the distribution of the data.
If you have a large, randomly collected

sample, the shape is also descriptive of
the population the sample was taken from.
Your book also describes stem-and-leaf

plots, which are similar, but rarely used.
82
27
Constructing a Histogram
Find the minimum and maximum of the
data.
1)
Break that interval into class intervals.
2)
5-20 classes is often a good start. More for

large samples, less for small ones.
A reasonable rule of thumb is
Select your classes so that each is of equal

width.
3)
Find the frequencies (counts, ni) and

relative frequencies (fi = ni/n) in each
class.
4)
Plot the bar chart with a bar over each

class whose height equals fi or ni.
83
84
Example: Stock Data (Annual Rate of Return,

1976-1995):
85
28
The shape of the histogram tells us about

the distribution. Some things to look for
include:
Is the distribution left-skewed?

Symmetric?
Right-skewed?
Is the distribution bimodal?
Multimodal?
Are there any outliers?

86
87
Its a good idea to look at several choices

of bin width and location, as different
choices here can produce dramatically
different histograms.
Features that remain in many histograms

are likely to be trustworthy; those that only
appear sometimes are less certain.
88
29
Example: Milk Fill Weights Data
89
90
91
30
92
93
Boxplots
Definition: A boxplot is another graphical

tool for displaying a sample:
94
31
The box goes from the first to the third

quartile, with a line at the median.
For boxplots, outliers are usually defined

as any values below
Q1 1.5 IQR
or above
Q3 + 1.5 IQR.
Those points are marked individually.
The whiskers go from the quartiles to the

least and greatest values among the nonoutliers.
95
Boxplots are much less informative than

histograms for a single distribution, so the
histogram is usually preferable.
On the other hand, comparing histograms

is difficult, while comparing boxplots is
easy.
Use boxplots to compare 2-20 (or more)

distributions.
96
Example: Fish length data
97
32
Example: Circuit board data by board.
98
Ch. 2: Bivariate Data
Statistics is most powerful when looking at

relationships between variables.
In the simplest case, this involves looking

at pairs of measurements made on the
same subjects, (x, y).
Recall, such data is called bivariate (two

variables).
99
Examples:
Heights and weights of a group of people.

ACT score and Freshman GPA for college
students.
January and April average temperatures for
many years at a specified location.
January and February inflows of the Nile river
at a location.
100
33
We usually picture our variables in a

cause-and-effect relationship.
The explanatory (independent, predictor)

variable, x, is assumed to play some role
in determining the value of the response
(dependent) variable, y.
x
101
Scatterplots (2.1)
Definition: A scatterplot is the most

common graph for displaying bivariate
data. It consists of plotting each point at
(xi, yi), on a standard x-y graph.
The pattern formed by the points

describes the relationship between the
variables.
102
103
34
104
105
106
35
Minitab Scatterplot:
107
Correlation
Suppose we have a sample of (x, y) pairs

and compute the sample means, and
For each observation (xi, yi), compute the

product of the two deviations from the
means.
Dividing the scatterplot at the means

results in two quadrants where the product
is positive, and two where it is negative.
108
109
36
For a scatterplot with a positive

relationship, most of the products will have
a positive sign, and the sum will be
positive.
Likewise, if the picture shows a negative

relationship, the sum of the products will
be negative.
Unfortunately, the exact value of the sum

depends on the units and spread (as
measured by standard deviation) of the
variables.
110
Dividing by measures of spread for x and y

solves this issue.
Then
is a good, unitless
measure of the linear relationship between x
and y called the correlation coefficient.
Example: Nile flow data: n=115
What is r?
111
112
37
Properties of r
1.
The value of r does not depend on the units of x

or y. We will not change r if we multiply all xs,
all ys, or both by a positive constant or if we add
any constant to all xs, all ys, or both.
2.
The value of r does not depend on which

variable is labeled x.
3.
Correlation is always between -1 and +1.
4.
The sign of r shows whether the relationship

between x and y is positive or negative.
113
Properties of r (continued)
5.
The absolute value of r measures the strength of the

linear relationship between x and y. Roughly
speaking:
a.
b.
c.
d.
If |r| < 0.5, the relationship (if any) is weak.

If 0.5 < |r| < 0.8, the association is moderate.
If 0.8 < |r| < 1.0, the association is strong.
If |r| = 1.0, the association is perfect. This occurs only
when all (x, y) points fall in a perfect line.
Note that strength is often context- and disciplinedependent. An engineer might find any correlation less
than .95 to be weak, while a social scientist might find a
correlation of .3 to be very strong.
114
115
38
116
Properties of r (continued)
6.
The correlation coefficient cannot measure the

strength of a nonlinear (curved) relationship.
7.
117
Outliers can also lead to an inappropriate value in either direction!
118
39
High correlation indicates strong

association, not necessarily causality.
If |r| is large, there are at least 3 possible

explanations:
1)
2)
3)
x determines y
y determines x
Some third value, z, (called a confounding
factor) determines both x and y.
119
Example: Weekly surveys show that per

capita chocolate consumption is strongly
correlated with traffic fatalities.
Should driving under the influence of

chocolate be outlawed?
Do people eat a lot of chocolate at funerals?
Is there a third explanation that makes more
sense?
120
Example: Over time, ministers salaries in

Massachusetts are strongly correlated with
the price of rum in Havana. What is the
causal relationship here?
Example: Childrens shoe size is

correlated with size of vocabulary. What is
the causal relationship?
121
40
One advantage of well-designed

randomized, controlled experiments is that
potential confounding factors should be
(roughly) balanced between levels of the
independent variable we are investigating,
so should be much less likely to produce a
spurious correlation.
122
Linear Regression (2.2 2.3)
Definition: Regression involves modeling

and predicting the values of one response
variable, based on the observed values of
one or more other explanatory variables.
Well focus on the case of simple linear

regression, where a straight line is fit to a
scatterplot of x and y.
123
We want an equation for a line of the form
The most common way to estimate and

uses the least squares fit, minimizing
This leads to the least squares estimates,
124
41
Deviations from a potential regression line:
125
The least squares line best fits the scatter plot.
126
Example: Nile flow data
What is the least-squares line for this data,

and what should we predict the flow for
February to be if Januarys was 3?
127
42
128
What would we predict for February from a

January value of 10?
Is this likely to be a valid prediction?

(Recall, Januarys mean is about 4, and its
standard deviation is about 1.)
Extrapolation outside the range of the data

is dangerous.
129
Residuals and Goodness-of-Fit
Definition: Given a data set (xi, yi) and an

associated fitted regression model, the
fitted value for observation i is
Definition: The residual for i is
The smaller the residuals, the better x and

the regression line are at predicting y.
130
43
The error sum of squares (SSE) is
SSE is usually compared to the total sum of squares, SST:
and the regression sum of squares, SSR:
To avoid having to calculate all the residuals, we may use the

computing formula:
SSE = SST - SSR

131
132
The coefficient of determination, r2,

measures the proportion of the total
variation of y which is explained by x:
The closer r2 is to 1, the more successful

the relationship is at explaining the
variation in y.
As the notation suggests, the coefficient of

determination is the square of the
correlation coefficient.
133
44
Example: Nile flow data:
Find SST, SSR, SSE, and r2.
What do these say about our predictions?
Note: r = 0.933.
134
The coefficient of determination r2 is found as R-Sq

in Minitab output.
The sums of squares may be found in the SS

column of the Analysis of Variance table.
The regression equation is

February Inflow = - 0.4698 + 0.8362 January Inflow
S = 0.330519
R-Sq = 87.1%
R-Sq(adj) = 87.0%
Analysis of Variance
Source
DF
SS
MS
83.3794
83.3794
763.25
0.000
Error
113
12.3444
0.1092
Total
114
95.7238
Regression
135
Chapter 3: Probability
Definition: Probability is the branch of

mathematics dealing with chance,
randomness, and uncertainty.
Probability provides most of the

mathematical foundation for inferential
statistics.
136
45
Definition: A situation for which the

outcome cannot be determined in advance
is called an experiment.
Examples:
The roll of a die.

The draw of a card.
The lifetime of an electronic component.
Definition: The sample space, S, of an

experiment is the set of all possible
outcomes.
Examples:
137
Die: S = {1, 2, 3, 4, 5, 6}
Card: S = ?
Component: S = ?
An experiment with several steps can be

visually represented by a tree diagram:
Example: Toss a coin three times:
138
139
46
Events
Definition: Set A is a subset of set B
(A B) if every element of A is also in B.
Example: S = {1, 2, 3, 4, 5, 6}
A = {1, 3, 5} S
B = {1, 2, 6, 7} S
Every set is a subset of itself.
The empty set, , consisting of no

elements, is a subset of every set.
Definition: Any interesting subset of the

sample space can be called an event.
Examples:
140
Die: A = odd numbers = {1, 3, 5}

Card: B = ?
Component: C = ?
The individual outcomes which make up S

are sometimes called simple events.
141
Combining Events
For subsets of S, A and B (A S, B S):
1)
The union of A and B (A B) is the set

consisting of all elements found in A, B, or
both.
Keyword: or
Example: S = {1, 2, 3, 4, 5, 6}
A = {1, 3, 5} S
B = {1, 2, 3} S
AB=?
142
47
The intersection of A and B (A B) is the

set consisting of all elements found in both
A and B.
2)
Keywords: and, both
Example: S = {1, 2, 3, 4, 5, 6}
A = {1, 3, 5}
B = {1, 2, 3}
AB=?
143
The complement of A (Ac) is the set

consisting of all elements of S not found in
A.
3)
Keyword: not
Example: S = {1, 2, 3, 4, 5, 6}
A = {1, 3, 5}
Ac = ?
144
Sets A and B are said to be mutually

exclusive if there are no elements in both
A and B. That is, if A B = (the empty
set).
4)
Example: S = {1, 2, 3, 4, 5, 6}
A = {1, 3, 5}
C = {4, 6}
A and C = , so A and C are mutually
exclusive.
145
48
Example: Three coin tosses.
S={HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
Let A = First toss is a head = ?
Let B = Last toss is a head = ?
What simple events make up the event A and

B?
A or B?
Not A?
Are A and B mutually exclusive?

146
The Axioms of Probability

Definition: A probability function P() is a
function from subsets of S (events) to the
real numbers which satisfies the following
axioms of probability:
1)
2)
3)
P(S) = 1.
0 P(A) 1 for all events A.
If A and B are mutually exclusive,
P(A B) = P(A) + P(B).
147
Example: A fair die.
P(1) = 1/6, P(2) = 1/6, P(3) = 1/6, P(4) = 1/6,

P(5) = 1/6, P(6) = 1/6.
Probabilities of bigger events are found by

axiom 3:
P({1,3}) = P(1) + P(3) = 1/6 + 1/6 = 2/6 = 1/3

P({1,3,5}) = ?
148
49
Example: A biased die.
P(1) = 1/12, P(2) = 1/6, P(3) = 1/6, P(4) = 1/6,

P(5) = 1/6, P(6) = 3/12 = 1/4.
Note:
(as required by axiom 2)
P({1,3}) = P(1) + P(3) = 1/12 + 1/6 = 1/4

P({1,3,5}) = ?
149
When applied to real experiments,

probability measures (long-term)
likelihood: if the experiment is repeated
many times, event A should occur roughly
P(A) fraction of the time.
150
Additional Properties of Probability
The axioms of probability imply some

additional properties:
1)
For any event A, P(Ac) = 1 P(A).
This is sometimes called the complementary

events rule, or the opposites rule.
Show:
Note: Since Sc = , P() = 0.
151
50
For any events A and B,

P(A B) = P(A) + P(B) P(A B).
2)
This is sometimes called the general addition

rule.
Show:
Note: if A and B are mutually exclusive,
P(A B) = P() = 0, so this is the same
as axiom 3.
152
Example: A fair die.
P(1) = 1/6, P(2) = 1/6, P(3) = 1/6, P(4) = 1/6,

P(5) = 1/6, P(6) = 1/6.
A = {1, 3, 5},
P(A) = 3/6 = 1/2.
B = {1, 2},
P(B) = 2/6 = 1/3.
P(Ac) = ?
A B = {1},
P(A B) = ?
P(A B) = 1/6.
We dont need to know the entire

probability function to use these.
Example: Lifetime of a component (T).

Suppose we know:
153
P(A) = P(T 60) = .47

P(B) = P(40 T 80) = .34
P(A B) = P(40 T 60) = .26
Then:
P(T 60) = ?
P(lifetime no more than 80) = ?
154
51
Example: Suppose the probability that an

integrated circuit chip has defective
etching is 0.12. The probability that the
chip has a crack defect is 0.29. And the
probability of both defects is 0.07.
What is the probability the chip does not

have defective etching?
What is the probability it has at least one

defect?
What is the probability it has neither

defect?
155
Equally Likely Outcomes
If S consists of N equally likely outcomes,

and event A consists of k of them,
P(A) = k/N.
Example: A fair die (see slides 148, 153).
Example: Draw a card at random from a

standard deck (52 cards, 13 spades). What
is the probability of drawing a spade?
Example: A shipment of 1000 hard drives

contains 6 which do not work. If we draw one
at random, what is the probability of selecting
a defective drive?
156
Conditional Probability (3.2)
Suppose we have partial information about

the outcome of an experiment. In
particular, suppose we know that the event
B has occurred.
We may use this information to revise the

probability of another event, A.
We call the revised probability a

conditional probability, as it depends on
the condition of B being true.
157
52
Example: Fair die. Let
A = {1, 3, 5}
P(A) = 3/6 = 1/2
B = {1, 2, 3}
P(B) = 3/6 = 1/2
P(A B) = P({1, 3}) = 2/6 = 1/3
If I roll the die and, without showing you, tell
you event B has occurred (I rolled no greater
than 3), now what is the probability of event
A?
Since B has occurred, the sample space

reduces to B: {1, 2, 3}.
Two of the three possibilities are odd (in

A), and the chances are still equal. So
P(A|B) = 2/3.
Once we know the roll is 3 or less, the

probability increases to 2/3 that its odd.
158
159
Definition: The conditional probability of A

given B is
(undefined if P(B) = 0).
This is the probability, given that event B

has occurred, that event A has also
occurred.
Die:
160
53
Example (continued from slide 155):
P(defective etching) = 0.12.

P(crack defect) = 0.29.
P(etching and crack defects) = 0.07.
If a chip has a crack defect, what is the

(conditional) probability that it also has
defective etching?
161
What is the probability that a chip has a crack

defect but satisfactory etching?
If a chip has a crack defect, what is the

probability that it has satisfactory etching?
Note: P(A|B) = 1 P(Ac|B) , just like

P(A) =1 - P(Ac).
162
If a chip has defective etching, what is the

probability that it also has a crack defect?
No relationship between P(A|B), P(B|A).
163
54
Independence
Definition: If P(A B) = P(A) P(B), we say

A and B are independent.
If A and B are independent, P(A)>0,

P(B)>0, then
Likewise, P(B|A) = P(B). Your book uses

this as the definition of independence.
164
Assuming P(A)>0, P(B)>0, any one of
P(A B) = P(A) P(B)

P(A|B) = P(A)
P(B|A) = P(B)
proves independence and the other two.
165
Example: Draw one card at random from a

well-shuffled deck. Define:
A = {draw a club}
B = {draw an ace}
C = {draw a red card}
Are A and B independent? A and C?
166
55
Note that events being mutually exclusive

and their being independent is not the
same thing.
Show: If P(A) > 0, P(B) > 0, and A and B

are mutually exclusive, they cannot be
independent!
167
Well often assume independence to

calculate probabilities of intersections.
Example: Roll a red die and a black die.
A = {red 6}
B = {black 6}
P(A) = 1/6
P(B) = 1/6
(fair dice)
Results on one die shouldnt influence the

other, so we assume independence.
P(double-sixes) = P(A B) = P(A) P(B)

= (1/6)(1/6) = 1/36.
168
This extends to more than 2 events.
The multiplication law for independent

events says that if events A1, A2, , An
are independent (that is, knowledge of any
combination of the Ais does not change
the probabilities of the remainder), then
P(A1 A2 An) = P(A1) P(A2) P(An).
Note: this is the probability that all n

events occur.
169
56
Example: Flip a fair coin 4 times.
Let Ai = {Flip i is a head}.

P(Ai) = 1/2,
i = 1, 2, 3, 4
Separate flips are independent. (Why?)
P(4 heads) = P(A1 A2 A3 A4)
= P(A1) P(A2) P(A3) P(A4)
= (1/2) (1/2) (1/2) (1/2)
= 1/16.
170
Example: Draw a card from a standard

deck 3 times with replacement (replace
and reshuffle after each draw).
Let Ai = {Draw i is a spade}.

P(Ai) = 13/52 = 1/4, i = 1, 2, 3
Separate draws are independent. (Why?)
P(3 spades) = ?
What if events arent independent?
Recall,
Therefore, P(A B) = P(A|B) P(B).
The general multiplication law:
171
P(A1 and A2) = P(A1) P(A2|A1).

172
57
Example: Suppose we have 4 cards,

labeled 1, 2, 3, and 4. Suppose we
draw two at random without replacement.
What is the probability both cards are
odd?
173
Example: Suppose we draw two cards at

random without replacement from a
standard deck. What is the probability
both cards are spades?
174
Random Variables (3.3)
Definition: A random variable is a random

number. It is obtained by assigning a
number to each outcome of an
experiment.
Example: Roll a die. The number rolled is

a random variable.
175
58
Example: Flip a coin 5 times. Is the

sequence of heads and tails a random
variable (Example: HHTHT)?
Some random variables we could

generate from 5 coin flips:
X=#H
Y=#H#T
Z = # H before first T
We usually denote random variables by

capital letters from the end of the alphabet.
176
Example: Select a rat at random from a

large colony. What are some possible
random variables?
177
There are two main types of random

variables: discrete and continuous.
Definition: A discrete random variable can

only take on a specified (countable) list of
values. There is a gap between any two
elements in its sample space.
In practice, these are usually counts of some

sort, and thus whole numbers.
Example: Number of heads in 5 coin flips.

178
59
Definition: A continuous random variable

may take any real number in some (set of)
interval(s).
Examples: Weight, lifetime.
We will need to deal differently with

discrete and continuous random variables.
179
Discrete Random Variables

Definition: The probability mass function
(p.m.f.) of a discrete random variable X is
a function p() from the support of X to the
real numbers, where
p(x) = P(X = x) .
Notation:
X: capital letter, indicates a random variable.

x: lowercase letter, indicates a specific value.
Example: Let X be the roll of a fair die.
180
S = {1, 2, 3, 4, 5, 6}
p(1) = P(X = 1) = 1/6
p(2) = P(X = 2) = 1/6
and so on.
We might write
p(x) = 1/6
x {1, 2, 3, 4, 5, 6}
181
60
Example: An industrial plant has 3

machines. The probability that X are
operating at a given random time may be
found from
x
p(x)
0.12 0.27 0.46 0.15
The laws of probability tell us that:
1)
? p(x) ?
2)
x S p(x) = ?
for all p(x)
182
183
A p.m.f. is plotted as spikes:
184
61
Or as a probability histogram, with areas

equal to probabilities:
185
186
187
62
Continuous Random Variables
Recall, a continuous random variable may

take any value in some real interval.
Continuous random variables are typically

measurements (length, weight, lifetime,
etc.).
With continuous random variables, we

cant use a p.m.f. to find probabilities.
Instead:
Definition: A probability density function

(density, p.d.f.), f(x), is a function which
determines the probability properties of a
continuous random variable. If X f(x),
then
188
189
If f(x) is a p.d.f.:
f(x) ?
for all x, and
Note: for a continuous random variable,

Why?
190
63
Example: a continuous random variable

has p.d.f.
Is f(x) a true p.d.f.?
191
Example (continued): What is the

probability that X will be between 0.5 and
1.0?
P(2.5 X 3.0) = ?
P(0.2 X 0.2) = ?
P(X < 1.0) = ?
192
193
64
Definition: The cumulative distribution

function (c.d.f.), F(x), of a random variable
is defined as
F(x) = P(X x).
If X is continuous,
194
Properties of continuous c.d.f.s:

1)
limx-F(x) = 0
2)
limxF(x) = 1
3)
4)
F is nondecreasing (if x < y, F(x) F(y) ).

P(a X b) = P(X b) P(X a)
= F(b) F(a).
This is often easier than integrating f(x).
Example (back to earlier p.d.f.):
P(0.5 X 1.0) = ?
195
(Compare to slide 192.)

196
65
The Population Mean

Definition: The population mean (expectation,
expected value) of random variable X is
if X is discrete, and
if X is continuous.
It can be thought of as the long-term average
of X, or the mean of a sample that follows the
distribution of X perfectly.
197
Example: Die roll
p(x) = 1/6 x{1, 2, , 6}

=?
Example: Machines
x
p(x)
=?
0
1
2
3
0.12 0.27 0.46 0.15
198
199
Example:
=?
Example:
=?
66
Expectations of Functions of
Random Variables
Given a random variable, X, suppose we

are really interested in a function, h(X).
The expected value of h(X) is
if X is discrete, and
if X is continuous.
Example: X ~ p(x) = , x = 1, 2.
What is E(X2)?
Note: In general, E[h(X)] h[E(X)].
Example: For the above p.m.f., what is

E(X)? [E(X)]2?
Is E(X2) = [E(X)]2?
200
201
The Population Variance and

Standard Deviation
Just as we have a population mean to

measure of the center of a distribution, the
population variance and standard
deviation measure a distributions spread.
202
67
Definition: Let X be a random variable with

mean . Then the population variance of
X, 2, is
Definition: The population standard

deviation, , of random variable X is the
square root of the variance of X.
203
Example: Die roll
p(x) = 1/6 x{1, 2, , 6}

=?
E(X2) = ?
V(X) = ?
=?
Example: p(x) = 1/2 x{3, 4}
=?
E(X2) = ?
V(X) = ?
=?
204
Example: Machines
x
p(x)
=?
E(X2) = ?
V(X) = ?
=?
0
1
2
3
0.12 0.27 0.46 0.15
205
68
Example:
=?
E(X2) = ?
V(X) = ?
=?
206
Linear Functions of Random

Variables (3.4)
Recall, a linear function (or linear

combination) of variables x1, x2, , xn, is
a function of the form
f(x1,x2,,xn) = a1x1 + a2x2 + +anxn + b
where b and all of the ais are fixed
constants.
207
Given any random variables X1, X2, , Xn

and known constants a1, a2, , an, and b,
then
E(a1X1 + a2X2 + + anXn + b) =
a1E(X1) + a2E(X2) + + anE(Xn) + b .
To find the expectation of a linear

combination of random variables, we need
only know the constants and the
expectation of each random variable
individually.
208
69
Example: Let X be a random temperature

measured in degrees Celsius, with E(X) =
10. Let Y be the same temperature in
degrees Fahrenheit, Y = 9/5 X + 32. What
is E(Y)?
Example: The expectation of the roll of a

fair die is 3.5. What is the expectation of
the sum of four such rolls?
209
Independent Random Variables
Recall, events are said to be independent

if knowledge of one does not affect the
probability of the other.
Likewise, random variables X and Y are

independent if knowing the value of X
does not affect probabilities of Y, no
matter what value X takes (and viceversa).
210
If X and Y are independent, any event

involving X alone will be independent from
any event involving Y alone.
P(X A and Y B) = P(X A)P(Y B)
for any A and B.
Draws with replacement are independent.
Draws in a simple random sample are not

independent, but may be treated as
though they are if the sample size is much
smaller than the population size.
211
70
If the random variables are independent,

then
V(a1X1 + a2X2 + + anXn + b) =

a12V(X1) + a22V(X2) + + an2V(Xn) .
Notes:
The shift b does not affect the variance.

The coefficients ai are squared.
Dependent random variables require a more
complex formula.
212
Example: Let the variance of the Celsius

temperature X be V(X) = 25.
What is the standard deviation of X?
What is the variance of Y = 9/5 X + 32?
What is the standard deviation of Y?
213
Example: The variance of the roll of a fair

die is 35/12. What is the variance of the
sum of four such rolls?
If we take a single roll and multiply it by 4,

what is the variance of the result? Why is
this different?
214
71
Suppose X and Y each have mean 10 and

variance 4. What are the mean and
variance of Z = X Y?
215
Mean and Variance of

the Sample Mean
An important special case concerns the

sample mean of the Xis,
Note that
Xis.
is a linear combination of the
216
Theorem: If X1, X2, Xn are independent

random variables, each with E(Xi) = and
V(Xi) = 2, then
and
Proof:
217
72
Example: A (possibly biased) coin has

probability p of coming up heads. We flip
it and let X = 1 if heads, 0 if tails.
What are E(X) and V(X)?
Suppose we flip it n times, and look at
218
Chapter 4: Common Distributions
Often we will have useful mathematical forms

which represent entire families of
distributions.
These distributions include one or more

constants (called parameters) which must be
specified to define a specific distribution.
We will concentrate on two especially

important families, the binomial and normal
distributions.
219
The Binomial Distribution (4.1)
The binomial distribution is the most

important common named family of
discrete distributions.
Recall, a discrete distribution is described

by a probability mass function p(), where
p(0) = P(X = 0)
p(1) = P(X = 1)
and so on.
220
73
Suppose our experiment consists of trials

with only two possible outcomes.
One outcome called a success occurs

with probability p.
The other outcome is called a failure, and

occurs with probability (1 p).
Such a process is called a Bernoulli trial

(after 17th-century probabilist James
Bernoulli).
The binomial distribution looks at a fixed

number of independent identical Bernoulli
trials, and counts the number of successes.
221
Example: Suppose silicon computer chips

are made in pairs, and that 30% of all
chips produced are defective.
Also assume that the chips in a pair are

independent of each other.
Out of pairs in which the first chip is good,

the second is defective in 30% of pairs.
This remains true for pairs in which the
first chip is defective.
222
Out of all pairs, 70% will have a good first

chip. Out of those, 70% will also have a
good second chip. Overall, 70% of 70%, or
49% (.7*.7 = .49) will have two good chips.
Likewise, 30% of that 70%, or 21% overall

(.7*.3 = .21) will have a good first chip and a
defective second chip.
By the same reasoning, 30% will have a

defective first chip, and 70% of those (21%
overall) will have a good second chip.
Finally, 30% of 30%, or 9% will have both

chips defective.
223
74
If we let the letter S (for success)

represent a good chip, and F (for failure)
represent a defective one, we can
summarize as:
P(SS) = .7*.7 = .49

P(SF) = .7*.3 = .21
P(FS) = .3*.7 = .21
P(FF) = .3*.3 = .09
Now let X be the number of good chips

produced in a pair.
Then X can take the values 0, 1, or 2.
224
From the above,
p(0) = P(X = 0) = P(FF) = .09

p(2) = P(X = 2) = P(SS) = .49
p(1) = P(X = 1) = P(SF or FS) = .21 + .21
= .42
225
What if the chips are produced in sets of

4?
If we want the probability of a set

consisting of 2 good and 2 defective chips,
we can think about the case of SSFF the
first and second chips are good, while the
third and fourth are defective.
The probability of this particular outcome

will be .7*.7*.3*.3 = .0441 or 4.41%.
226
75
But there are other ways we can have two

successes and two failures 5 other
ways, in this case:
P(SSFF) = .7*.7*.3*.3 = .0441

P(SFSF) = .7*.3*.7*.3 = .0441
P(SFFS) = .7*.3*.3*.7 = .0441
P(FSSF) = .3*.7*.7*.3 = .0441
P(FSFS) = .3*.7*.3*.7 = .0441
P(FFSS) = .3*.3*.7*.7 = .0441
Overall, p(2) = P(X = 2) = 6*.0441

=.2646.
227
In general, suppose we have an

experiment consisting of n independent
Bernoulli trials.
Those trials which satisfy the condition we

wish to count are called successes, and
occur with probability p.
The remaining trials are called failures;

these occur with probability (1 p).
Let X be the number of successes in the

full experiment.
228
If these conditions are true, we say that X,

the number of successes in the
experiment, has a binomial distribution
with parameters n and p.
X Binomial(n, p) or X Bin(n, p) .
The mass function for X is:
229
76
Note: the exclamation mark is pronounced

factorial.
Given n items, n! is the number of

arrangements, and is found as
n! n (n-1) (n-2) 2 1.
Since there is one (empty) way to arrange

0 objects, we define 0! = 1.
230
Example: The chips (30% defective) are

produced in batches of 4. Let X be the
number of good chips in a batch.
What distribution does X follow?
What is p(2)?
What is the probability that a random batch

will contain no more than one good chip?
231
Example: In a genetics study, a secondgeneration cross of pure green peas with

pure yellow peas leads to pods where p =
P(yellow) = .
If pods contain 8 seeds, what is the

probability that a random pod will contain 6
yellow seeds?
What is the probability that a random pod

will contain at least 6 yellow seeds?
232
77
Table A.1 in your book can save

calculations by providing probabilities of
P(X x) for n 20 and certain values of p.
Example: Draw 16 times with replacement

from a standard deck, and let X = number
of spades drawn.
Find P(X > 6).
233
With standard distributions, the mean and

variance may generally be found as a
function of the parameters.
If X Binomial(n, p), then = np.
Example: If 75% of all seeds are yellow, and

each pod contains 8 seeds, what is the mean
number of yellow seeds per pod?
Example: If we have 4 fair coins which we flip
as a batch, what is the mean number of
heads?
234
Additionally, if X Bin(n, p), then

2 = np(1 p).
Example: X = # yellow seeds ~ Bin(8, .75).

What are the variance and standard deviation
of X?
Example: X = # heads in 4 flips ~ Bin(4, .5).

What are the variance and standard deviation
of X?
235
78
Recall, draws without replacement (simple

random samples) are not independent.
However, we may do calculations as

though they are independent (including
binomial calculations) as long as the
sample size is small (less than 5%)
compared to the population size.
236
Example: A lot of several thousand

components contains 7% defective. We
sample 8 at random.
What is the probability of no defective

components in our sample?
What is the probability of at least one

defective?
What is the expected number of defectives

in our sample?
237
The Normal Distribution (4.3)
The continuous normal (or Gaussian)

distribution has two parameters, and 2.
If X ~ N(, 2),
This distribution is often seen in practice,

and is also very important theoretically.
238
79
The normal p.d.f. is a

bell-shaped curve,
symmetric around,
and with its peak at,
. E(X) = .
Its width is
determined by 2;
large values of 2
imply a wide, low
curve, while small
values imply a
narrow, tall one.
V(X) = 2.
239
An important special case is the standard

normal distribution, with = 0 and 2 = 1.
We usually identify standard normal

variables with the letter Z.
If Z is standard normal, Z~N(0,1) and the

density of Z is
240
There is no closed-form integral for the

normal probability density function, so we
cant find probabilities that way.
To find normal probabilities, we must use

computer programs (which themselves
use numeric integration), or tables such as
Table A.2 (p. 521-522, and inside the front
cover of your book) of the standard normal
distribution.
241
80
242
Examples:
P(Z 1.00) = ?
P(Z > 1.00) = ?
P(-2.00 Z 0.75) = ?
243
For X ~ N(, 2), we find proportions by

converting to standard units.
If X ~ N(, 2), then Z = (X - )/ ~ N(0,1).
Remember to convert both sides of any

inequality the same way.
244
81
Examples: Let X ~ N(3, 4).
P(X 6.00) = ?
P(X > 4.00) = ?
245
Normal Percentiles
Just as for samples, the pth percentile of a

distribution has p% of the probability below
it, and (100 p)% above.
We find percentiles for the normal

distribution using Table A.2 again, but
reading from the inside out.
Since probabilities are in the middle of the

table, start there.
Read to the outside to find the percentile.
246
Example: Z ~ N(0, 1). What is the 70th

percentile of Z?
Example: What is the 25th percentile of Z?
247
82
For non-standard normal variables, first

find the desired percentile for the standard
normal, then use the fact that since
Z = (X - )/, therefore X = + Z.
Example: X ~ N(10, 25). What is the 95th

percentile of X?
248
Besides the binomial and normal

distributions, there are a number of other
named families of distributions with useful
properties.
For example, the Poisson distribution

(Section 4.2) is useful for modeling random
counts in a fixed interval of time or space.
See Sections 4.4-4.6 for discussion of the

lognormal, exponential, gamma, and Weibull
distributions, which are useful for modeling
continuous histograms which are positively
skewed and unimodal.
249
Sampling Distributions (4.8)
Suppose random variable X is drawn from

some distribution f. (X ~ f )
Now suppose we generate n of these

random variables, X1, Xn, independently
from f.
We say that X1, Xn make a random sample

from f.
Sometimes we say that X1, Xn are i.i.d.
(independent and identically distributed) from
f.
250
83
Since the Xs make a sample, we can

compute sample statistics such as the
mean,
Recall (3.4), since the Xs are random, so

is
and since it is a number, is itself a
random variable with a distribution.
This distribution is referred to as the

sampling distribution of
and plays a
large role in inferential statistics.
Example: Let pX(x) = 1/3, x = 1, 2, 3, and

let X1 and X2 be independent draws from
pX(x).
Now let = (X1 + X2)/2 be the average of
X1 and X2.
Note that is also a discrete random
variable, and therefore has a probability
mass function.
What is the mass function (sampling
distribution) of ?
251
252
Example: Suppose X ~ N(50, 4). A

histogram of 1000 Xs looks like this:
253
84
Sample 25 Xs and compute
If we repeat this process 1000 times, we

get a histogram such as this:
Note that
254
has a distribution that:
Is centered on 50 ();
Is narrower than the solid normal curve for the
individual Xs the variance and standard
deviation of are smaller than those of X.
Remains bell-shaped and (roughly?) normal.
Understanding the distributions of sample

statistics and their relationships to the
associated population parameters is the
basis of most of inferential statistics.
255
In general, if a sample statistic is used to

estimate a population parameter:
The sampling distribution of the statistic is

centered on (or at least near) the parameter.
The spread of the sampling distribution will
decrease as the sample size gets larger.
As the sample size gets larger, the shape of
the sampling distribution will usually get more
and more bell-shaped (normal).
256
85
Sampling Distributions of the Mean

Let
be the sample mean of a random
sample X1, X2, Xn, from a population or
process with mean and standard
deviation . Then (recall, Section 3.4):
The mean of the sampling distribution of ,

, is , the population mean, regardless of
sample size n.
The standard deviation of the sampling
distribution of ,
, is
, the population
standard deviation divided by the square root
of the sample size.
257
The standard deviation of the sample

mean,
, is often called the standard
error of the sample mean.
This emphasizes that it describes a

sampling distribution, not a population.
258
As the sample size gets larger, we have

more information and can make better
estimates, so the standard error
decreases.
(Note, however, that the square root

means we have diminishing returns; each
new observation provides less new
information than the previous one.)
The larger the sample, the closer

likely to be to .
is
259
86
260
If our original population has a normal

distribution, the sampling distribution of
is also normal, regardless of sample size.
Example: An automated filling machine

fills soft drink cans with a volume that has
a normal distribution with = 0.05 ounces.
If we sample 4 cans and take the sample

mean, what is the probability that will be
within 0.04 ounces of the population mean
?
261
The Central Limit Theorem
The Central Limit Theorem is the most

important theorem in statistics.
It shows the importance of the normal

distribution, and provides the justification
of many of the most fundamental statistical
methods.
262
87
If we know that a population or process

has a normal distribution, we know that the
sampling distribution of will also be
normal. This allows us to compute useful
probabilities.
Unfortunately, we often do not know the

population distribution (or perhaps we
know that it is not normal).
Fortunately, this is not always required.

263
The sample mean (or sum) of a large

number of independent random variables
has a sampling distribution which is
approximately normal, no matter what
distribution the original random variables
come from.
This important result is the Central Limit

Theorem.
Theorem (Central Limit Theorem): If X1,

X2, Xn are independent random
variables, from a population or process
with mean and standard deviation ,
then as long as n is sufficiently large,
We can use this to find probabilities for

sums or averages, without knowing the
distribution of the Xis!
264
265
88
266
Example: The (population) mean time

required for maintenance on an airconditioning unit is 1 hour, and the
standard deviation is also 1 hour. A
company operates 50 such units.
Could we find the probability that the

maintenance on a single unit requires more
than 2 hours from the information given?
What is the probability that the average time

for maintenance will be more than 75
minutes?
What is the probability that the total time for

maintenance will be less than 40 hours?
267
268
89
How large is large?
As a general rule, n 30 is usually large

enough that the Central Limit Theorem is
reasonable.
Symmetric populations can get by with

much less, often as few as 10, or even
fewer.
Highly skewed populations require more.

50 or more should be fairly safe in all but
the worst cases.
269
The Normal Approximation

to the Binomial Distribution
Recall, if X ~ B(n, p), then E(X) = np and

V(X) = np(1-p).
If the particular values of n and p lead to a

binomial distribution which is not very
skewed, the
distribution can be a good approximation to
the B(n,p) distribution.
We usually require that np 10 and
n(1-p) 10 .
270
Example: Roll a die 120 times and count

the number of 6s rolled (X).
What distribution does X follow?
What are E(X) and V(X)?
What is P(X 25)?
271
90
The true binomial probability is 0.136.
Were pretty close, but we can do better.
Binomial probabilities are located entirely

on the integers, but normal probabilities
are smeared out over the whole real line
(remember the probability histogram).
Well get a better approximation if we use

a continuity correction, by taking the
normal probability from (x - .5) to (x + .5)
to approximate the binomial P(X = x).
272
273
So, for X ~ B(120, 1/6),

P(X 25) = P(X 24.5) =
Example: If X~Bin(120, 1/6), use the

normal approximation to estimate
P(15 < X < 25).
274
91
Chapter 5: Statistical Estimation
The remainder of the course will focus on

inferential statistics.
Recall, in probability, we generally know

the distribution in question and wish to
calculate something about particular
outcomes or events.
275
In inferential statistics, we have a sample,

and wish to use that information to say
something about the population or
distribution the sample was drawn from.
Probability
Population
Sample
Inferential
Statistics
276
Recall: A parameter is an unknown

quantity related to a population or
distribution.
A statistic is a known quantity which can

be calculated from a dataset.
Estimation uses a statistic (what we know)

to tell us something about an unknown
parameter (what we wish we knew).
277
92
Point Estimation (5.1)
Definition: A point estimate of a parameter

, is a statistic, , which represents a
best guess for .
Example: We have an unknown

distribution, X ~ f(x), and we wish to know
the unknown parameter = E(X). We
take a sample X1, X2, Xn, and estimate
with the known statistic
.
278
Other common point estimates:
Estimate V(X) = 2 with
If X ~ Binomial(n, p) (n known, p
unknown), estimate p with
.
All of our standard sample statistics

(median, quartiles, etc.) are good
estimates of the corresponding population
or distribution parameters.
279
Properties of Estimates
There are a few properties that we like to

see in a parameter estimate.
On average (over many samples), an

estimate should give the correct value for
the parameter. If the mean of the
sampling distribution of our estimate is the
parameter we are estimating, that is,
we say that is an unbiased
estimate of .
280
93
Example: We know that

unbiased estimate of .
so
is an
Also,
and
(proof:)
so the sample variance and proportion

are unbiased estimates of the population
variance and proportion.
This is why we divide by (n 1) instead of

n to find s2.
On the other hand, the sample standard

deviation, s, has
so s is a biased
estimate for .
Fortunately, the bias (defined as

or more generally,
) is small,
especially as n gets large.
281
282
Note that just because an estimate is

unbiased, does not guarantee that it will
give you the exact parameter on this (or
possibly, any) sample.
Example: X ~ Binomial(n = 25, p = 0.3).

Even though is unbiased for p, there is
no value of X that will give
Remember our sampling distributions; an

unbiased estimates distribution will be
centered correctly, but it will still have
some spread.
283
94
The variance of the sampling distribution of our

estimate measures that spread and is also
important in measuring how well it performs.
We combine these two aspects into a

single measure, the mean squared error:
A small MSE means that both bias and

variance are small.
Example: Suppose X1 and X2 are

independent, with E(X1) = E(X2) = and
V(X1) = V(X2) = 2.
Let
284
285
Find:
286
95
Example (continued): Let
Find:
For what values of and 2 is
287
Confidence Intervals (5.2)
Having a good estimate is a good first step

in learning about a population parameter.
We should also be interested in how close

our estimate is likely to be to the
parameter.
One approach is to calculate the standard

error, remembering that we will usually be
within 2-3 standard errors of the parameter
(if we use an unbiased estimate).
288
Another way to look at this issue is that we

know our estimate is incorrect. (We just
dont know by exactly how much.)
We can improve this situation by

expanding our point estimate to an interval
estimate, providing a range of plausible
values for .
Done carefully, we can identify how likely it

is that our interval includes .
289
96
If our sample size, n, is large, we can use

the Central Limit Theorem to give us the
following.
290
Therefore, the interval
is a random interval which covers the

population mean with probability 0.95.
We call such an interval a 95% confidence

interval.
This represents a set of plausible values of

that are consistent with the data.
Example: A random sample of 80 auto

body shops for cost to repair a particular
kind of damage have mean $472.36 and
standard deviation $62.35.
What is the 95% confidence interval for

the mean of this population?
291
292
97
Is it correct to say
P(458.70 486.02) = 0.95 ?
No! Nothing inside the probability

statement is random. Recall:
The random parts are the sample

statistics.
The interval is random, not the population

parameter, .
293
If we constructed many 95% confidence

intervals from independent datasets, wed
get many different sample means and
sample standard deviations, and each
would lead to a different confidence
interval.
In the long run, about 95% of these

different confidence intervals would
contain the true parameter .
Remember, randomness is in the sample

and the interval, not in the parameter!
294
295
98
We call the value 95% the confidence

level. We say we are 95% confident that
the population mean lies within the
computed interval.
We can select other confidence levels if

desired, by replacing the critical value 1.96
with the Z-percentile that gives the
appropriate center probability.
A confidence level of 95% (1.96) is most

common, but levels of 90% (1.645) and
99% (2.575) are also often used.
296
In general, define zp to be the value,

above which there is probability p in the
tail of the standard normal distribution.
Then zp will be the 100(1-p)th percentile of

the standard normal distribution.
For a 100(1-)% confidence interval, we

use the critical value z/2.
Example: What critical value would we use

for an 80% confidence interval?
297
298
99
What factors affect the length (precision)

of the confidence interval?
s If s is bigger,
is less accurate, and the
interval must be wider.
Confidence level To be more confident of
including the true value, we must make the
interval wider.
n as n gets bigger, the standard error of
gets smaller, and the interval gets narrower.
299
300
If we require a 95% confidence interval of error

width (interval half-width) no more than w, we
can compute a (rough) minimum sample size if
we have an estimate or upper bound for s.
Of course, we can substitute the appropriate Z

critical value to find sample sizes for other
confidence levels.
301
100
Example: Milk fill weights.
n = 50,
= 2.0727, s = 0.0711
Find a 95% confidence interval for .
w=?
If we require w 0.01, how big should n be?
302
Confidence Bounds
Sometimes, we only wish to know a lower

(or upper) bound on .
We can generate one-sided confidence

intervals, also called confidence bounds,
in a similar way to the usual two-sided
case.
303
If we have a large sample, then:
A 95% lower confidence bound for is
A 95% upper confidence bound for is
To get 90%, 99%, or 100(1-)% bounds,

replace 1.645 with 1.28, 2.33, or z,
respectively.
304
101
Example: A sample of 48 Shear strength

measurements give a mean of
17.17 N/mm2 and a standard deviation of
3.28 N/mm2.
If we only care that the population mean

shear strength is great enough, find a 90%
lower bound on .
305
For our normal-based confidence interval

and level to be valid, we must know (or at
least assume) that:
The sample is a random draw from the

population.
The sample size n is large enough that the
sample mean is approximately normally
distributed and that s is a good estimate of .
306
Chapter 6: Hypothesis Testing
Estimation (both point and interval) is

useful for providing an idea of the value of
a population parameter.
Frequently, we may wish to investigate a

more specific question about a parameter.
For this purpose, we use the other major
branch of inferential statistics, hypothesis
testing.
307
102
One-Sample Z-Tests (6.1-6.2)
Example: (Milk data) Suppose our bottlefilling machine is supposed to dispense

2.04 L of milk. Recall, a sample of size 50
gave = 2.0727, s = 0.0711. Does the
machine need to be recalibrated?
To answer this, lets assume that the

machine is working properly, and see how
likely we are to get a sample mean as far
or further from the expected value as the
sample mean we actually saw (2.0727).
308
More formally, we choose a null

hypothesis, H0.
This is a statement about a population

parameter (say, ), generally that it is
equal to the value of interest (denoted 0).
Usually, the null hypothesis means

everything is as it should be, or nothing
interesting is happening.
Here:
H0:
= 2.04 (= 0)
309
We also choose an alternative hypothesis,

H1, that the null is incorrect.
H1:
The alternative is literally simply that the

null is incorrect, but this is often the more
interesting or important result.
2.04
310
103
Next, we compute a test statistic, under

the assumption that H0 is correct.
For large-sample tests on the population

mean, , we usually use the z-statistic:
Here: z = ?
If H0 is true,
Is z a typical value from a N(0, 1)

distribution?
and z ~ N(0, 1).
311
Formally, we find a P-value, the probability

that a sample from the null distribution
would give a test statistic as or more
unusual as the one we just saw.
Since H1: 2.04, we use a two-sided

P-value: P = P(|z| 3.25) (z ~ N(0,1)).
From our table, if z ~ N(0,1),

P (|z| 3.25) = .0012.
312
So we have two possibilities:
1)
H0 is correct, = 2.04, and we got very

unlucky to happen to get the (roughly) 1 in
800 chance to get
2.0727 (or the
equally unusual 2.0073), or
2)
H0 is wrong.
Which seems more reasonable to believe?
Since P is so small, we reject H0 and

decide the filling machine does require
recalibration.
313
104
All hypothesis tests follow this general

pattern:
1)
2)
3)
4)
5)
We observe some difference in a sample and

wish to decide if it reflects a true difference in
the population.
Identify the null and alternative hypotheses.
Compute a test statistic which has a known
distribution when the null hypothesis is true.
Find a P-value: the probability of a statistic as
or more unusual than the one we observed,
when the null hypothesis is true.
If P is small, reject the null hypothesis.
Otherwise, fail to reject it.
314
This basic pattern holds for many different

tests on different parameters with different
assumptions.
For questions about the population mean

for a single population, we often use the
one sample z-test demonstrated above.
315
Details on the one-sample z-test:

We have a single population, and a
specific value, 0, we wish to consider for
the population mean.
1)
This may be a known population mean for

some related population (see next example).
Or it may be a desired population mean
(example: milk data).
A sample from the population will give a
sample mean different from 0, even if that is
the actual population mean.
316
105
Identify H0 and H1.
2)
H1 is a statement that something interesting is

going on. It is usually what we wish to prove.
We should decide if we care about a onesided or two-sided alternative, ideally before
we ever see data.
Two-sided: H0: = 0 vs. H1: 0.
One-sided: H0: 0 vs. H1: > 0
or:
H0: 0 vs. H1: < 0
We always compute z and P using 0, so
= 0 is always part of H0.
Example: Example: A newspaper article

says that college freshmen average 7.5
hours per week at parties.
We suspect the number is lower at our

college.
317
H0 = ?
H1 = ?
318
Compute the test statistic.
3)
For a one-sample z-test, use the z statistic

with 0:
If is unknown, use s instead.
Example (cont): Interview 100 freshmen.

The average reported time spent at parties
is 6.6 hours, and the standard deviation is
9 hours.
z=?
319
106
Find the P-value.
4)
Under H0, z ~ N(0,1). Use probabilities on

z* ~ N(0,1), depending on H1:
H1
P
0
P(|z*| |z|) = 2 P (z* -|z|)
> 0
P(z* z) = P(z* -z)
< 0
P(z* z)
This gives the probability that, if H0 is true, a

new sample would give a statistic
which
disagrees with H0 at least as much as the
statistic
we have.
320
Example (cont): Interview 100 freshmen.

= 6.6 hours, s = 9 hours.
z=?
P=?
321
5)
Reject H0 for small P.
Choose a small significance level, .

Values of 0.05 or 0.01 are most commonly
used.
If P , the evidence is pretty strong against

H0, and we say we reject H0 (at the level).
We have strong evidence of H1.
If P > , our test statistic is pretty reasonable
under H0, so we say we fail to reject H0.
Note: a large P-value is not proof of H0; many
other hypotheses may also be reasonable.
This is why we do not say that we accept H0.
322
107
Many people call any result with P < 0.05

statistically significant, and any with P < 0.01
highly (statistically) significant.
Note that this is very artificial. A P-value of
0.049 is only slightly stronger than one of
0.051, yet we treat them very differently.
We should always report the P-value, to
provide full information.
You should always explain in words what your
conclusion implies for the situation.
323
Example (Student data, cont): What does

our P-value suggest about our
hypotheses?
What does this say about the party habits

of freshmen at our university?
324
If P = 0.16, is it correct to say

P( 7.5) = .16?
In general, with P-value P, can we say

P(H0 is true) = P?
No! Nothing in that statement is random.
P is a conditional probability given H0 is

true, not a probability on H0 itself.
325
108
Example: A machine that produces metal

cylinders is set to make cylinders with
diameter 50mm. A random sample of 60
cylinders has = 49.9865 and s = 0.0524.
Is the machine calibrated correctly?
326
Note that practical significance is not the

same as statistical significance.
Even though we found statistical

significance for our machines calibration,
it may be that the difference between
50mm and 49.9865mm is too small to
justify the expense of recalibration.
327
Large samples are particularly prone to

indicate statistical significance despite a
difference too small to be important.
Conversely, small samples may come with

large standard errors, so that a difference
which might be very important if confirmed
cannot be shown to be statistically
significant.
328
109
We should supplement our test with a

confidence interval, which will do a much
better job of indicating the size and
therefore importance of a potential
difference.
Example: Machine: n = 60,

and s = 0.0524.
What is a 95% confidence interval for ?
= 49.9865
Note that a 100(1 - )% confidence

interval for will include (exclude) 0
exactly whenever a two-sided test of
H0: = 0 fails to reject (rejects) at the
level.
Similarly, a 100(1 - )% lower bound will

fall below (above) 0 exactly whenever a
one-sided test of H0: 0 fails to reject
(rejects) at the level.
Likewise for upper bounds and tests on

H0: 0.
329
330
Small-Sample (t) Intervals (5.4)
Strictly speaking, the z-tests and confidence

intervals for means we looked at in Sections
5.2 and 6.1 require that we know the
standard deviation, , for our population.
In practice, for large samples, s is a good

enough estimate for that we can use s
without harming our P-values or interval
coverage severely.
If n is small, s may be far off from , and we

require an adjustment to our intervals that
takes into account this uncertainty.
331
110
The t-statistic
Since in practice, both and are usually

unknown, when n is small (n < 30) we
often use the following result:
If X1,Xn ~ Normal(, 2), then
has a t distribution with n-1 degrees of

freedom.
332
The t distribution is a bell curve with

heavier tails than the normal distribution.
It is always centered on, and symmetric

around, 0.
The t1 curve is most spread out (has the

heaviest tails). As gets larger, the tails
get lighter and the curve gets less spread
out.
As , the t distribution approaches the

standard normal distribution.
333
334
111
Table A.3
contains
important
percentiles
(critical values).
Each row
represents a
different t
distribution.
335
Example: T ~ t12
P(T 2.681) = 0.01

P(T < -1.356) = 0.10
Example: T ~ t9
P(T 1.833) = ?
P(-2.262 < T < 2.262) = ?
Find c such that P(T > c) = 0.001

336
We can use a calculation similar to the

one in Section 5.2 to justify a t-based
confidence interval when n < 30.
Let and s be the sample mean and

standard deviation from a sample of size n
from a normal population or process.
Then a confidence interval for the
population mean has the form
337
112
Note this is the same form as for the usual

z-interval. The only difference is the
replacement of the usual normal (z) critical
value (such as 1.96) from one found on
the t-table with (n-1) degrees of freedom.
One-sided intervals and bounds may be

found by taking only the appropriate limit
(+ or -) and choosing the one-sided t
critical value (tn-1,).
338
Example: An experiment measuring the tire

life of a new rubber compound finds the
mileage to end-of-life. A sample of size 10
finds a mean of 61,492 miles and a standard
deviation of 3,035 miles. A normal model is
appropriate.
Find a 95% confidence interval for the

population mean tire life.
If we only care about a minimum, find a 95%

lower bound for population mean tire life.
339
A brand of margarine was analyzed to

determine the level of polyunsaturated fatty
acid. In 6 samples, the mean percent is
16.98%, and the s.d. is 0.32%. A normal
distribution is reasonable for this variable.
Find a 99% confidence interval for population

mean percent pfa.
Find a 90% upper bound for population mean

percent pfa.
340
113
t-Tests (6.4)
We should also use the t distribution to

conduct hypothesis tests when n is small,
and is unknown.
This is especially important when n < 30,

but can be used for any sample size.
We will continue to require the assumption

of a normal population.
341
The process of conducting a t-test is

identical to conducting a z-test, except for
step 4, computation of the P-value.
1)
We have a single population, and a

specific value, 0, we wish to consider for
the population mean.
2)
Identify H0 and H1 just as for a z-test.
3)
Compute the test statistic
342
Find the P-value.
4)
Under H0, t ~ tn-1. Use probabilities on t* ~ tn-1, depending

on H1:
H1
P(|t*| |t|) = 2 P (t* |t|)
> 0
P(t* t)
< 0
P(t* t)
Tables (such as Table A.3) can give a range for P. Minitab

and other software can make your P-values exact.
P still gives the probability that, if H0 is true, a new sample
would give an
which is at least as unusual as the
we
have.
343
114
5)
Example: A car manufacturer claims a

model gets 35 mpg. A consumer group
wishes to test this claim. We measure 14
cars, find = 34.271 mpg, s = 2.915 mpg.
H0?
H1?
t=?
d.f. = ?
P=?
Conclusion?
344
Example Minitab output: One-sample

tests, intervals
Results for: cholesterol.txt

One-Sample T: Cholesterol in mg/dL
Test of mu = 175 vs not = 175
Variable
Cholesterol in m
Mean
StDev
SE Mean
20
205.800
48.392
10.821
95% CI
(183.152, 228.448)
2.85
0.010
345
Type I and Type II Errors (6.6)
We can make two different kinds of error

when hypothesis testing, depending on
our decision and the actual (unknown)
truth:
Truth
Reject H0
Decision
Fail to
Reject H0
H0True
H1 True
Type I
Error
Correct
Decision
Correct
Decision
Type II
Error
346
115
Type I Error is generally considered more

serious. This may influence the choice of
H0 and H1.
Our significance level, , is the probability

of Type I error we are willing to accept
(when the null hypothesis is true).
Note that this does not control Type II

error at all. The probability of Type II error
may be very large, especially for small n.
347
It might help to think about the possible

consequences of the different errors, by
considering the (usually nonstatistical)
example of a jury trial.
Truth
H0True
(Defendant
Innocent)
Reject H0
(Convict)
Decision
Fail to
Reject H0
(Acquit)
H1 True
(Defendant
Guilty)
Correct
Type I Error
Decision
Correct
Decision
Type II
Error
348
Power (6.7)
Definition: The power of a significance test

with significance level is
power = P(reject H0 | H0 false)
= 1 P(Type II Error | H0 false).
If there is more than one value of

associated with H1, power will generally be
computed for a specific value of .
349
116
Large power is good, and power will

increase as the sample size increases.
There is no firm rule about power, but it is

desirable to have a power of at least 0.8 or
0.9 for a difference which is big enough to
be important.
Power should be computed prior to

conducting an experiment whenever
possible, to verify that the experiment will
probably show results if the difference you
desire or anticipate exists.
To compute power, we must:

1.
2.
3.
Decide on a significance level, .

Compute the rejection region, the set of
possible values of
which would lead to
rejecting H0.
Compute the probability of finding in the
rejection region, given the specified value of
.
See your book for a full example.
350
351
Statistical software such as Minitab can compute the

power for a specified test, or the sample size necessary
to achieve a given power.
Power and Sample Size

1-Sample t Test
Testing mean = null (versus > null)
Calculating power for mean = null + difference
Alpha = 0.05
Assumed standard deviation = 1.5
Sample
Difference
Size
Power
0.786845
352
117
Multiple Testing (6.8)
Even if the null hypothesis is true, there is

a 5% chance that the result of a test will
have a p-value of less than 0.05.
If we carry out 200 tests, we expect 5%, or

10, to be significant, even if the null is true
every time.
This can be useful in exploration, but the

significant results should be reconfirmed
with an additional study with new data.
353
Example: Researchers looked at

asbestos fibers in water and many cancer
rates. They did many tests, only a few of
which gave p-values less than 0.05, and
only one of which gave less than 0.01.
They concluded there was a strong

relationship between lung cancer rates
and asbestos concentration, even though
their own study suggested that a 100-fold
increase in asbestos was accompanied by
a 5% increase in lung cancer rate.
354
If we have multiple tests, we can adjust for

this with the Bonferroni correction.
Adjusted P = Original P Number of tests
The Bonferroni adjustment is conservative,

so if the adjusted P is small, we may still
reject H0.
355
118
Example: Test 5 new fertilizer formulations

for mean yields higher than current standard
formulation.
Formulation A: P = 0.49
Formulation B: P = 0.24
Formulation C: P = 0.17
Formulation D: P = 0.003
Formulation E: P = 0.53
Adjust Formulation Ds P-value: PB = 5(.003)

= .015.
Still small, reject H0. D appears to be an

improvement.
356
Example: As before, but for D, P = .022.
Now PB = 5(0.022) = 0.11.
Not small enough to be conclusive. Fail to

reject H0. D might have been higher by
chance.
If the confidence interval suggests

practical significance, rerun the
experiment for D, and collect new data.
If P is small with new data, results are

convincing.
357
Chapter 7: Two-Sample Inference
In many situations, we may not care about

the specific value of the mean of a
population, so much as comparing the
means of two separate, but related,
populations.
We use two-sample inference to

investigate these sorts of questions.
358
119
Example: Sexual discrimination?
M = population mean of male salaries

F = population mean of female salaries
Test H0: M F vs. H1: M > F
Example: New fertilizer for corn
1 = average yield for the new treatment

0 = average yield for a common current
treatment
Test H0: 1 0 vs. H1: 1 > 0
Confidence interval on (1 - 0)
Suppose we have two populations or

processes.
Population 1 (X) has mean X and

standard deviation X.
Likewise, population 2 (Y) has mean Y

and standard deviation Y.
We will compare the individual means by

looking at the difference, X - Y.
359
360
We usually do this by collecting

independent samples from each
population (of sizes nX and nY, which may
or may not be the same) and computing
the sample means ( and ) and
standard deviations (sX and sY).
We will estimate the difference of

population means, X - Y, by the
difference of the sample means,
To do inference, we need to know about

the sampling distribution of
361
120
Recall the following results from Sections

3.4 and 4.3:
1)
For any X and Y, X-Y = X Y.

The mean of the difference is the
difference of the means.
2)
For any independent X and Y,

The variance of the difference is the sum
of the variances.
3)
If X and Y are (approximately) normal, so

is the difference.
362
Those results, together with what we

already know about the sampling
distributions of and
give us:
is an unbiased estimator of X - Y.
1)
Show:
The standard error of
2)
is
Show:
363
3)
If both populations are normal, so is the

sampling distribution of
4)
If both sample sizes are large, the Central

Limit Theorem tells us that the sampling
distribution of
will be approximately
normal no matter what shapes the
population distributions have.
5)
When standardizing using sample

standard deviations from small samples
from normal populations, we should
continue to use a t-distribution.
364
121
Two-Sample z-Tests (7.1)
If we have two independent samples, we

construct a two-sample test in much the
same way as the one-sample version.
If both samples are large, we may use the

normal distribution as we do in the singlesample case.
Remember our five steps of hypothesis

testing.
We have two populations whose means we

wish to compare, and a sample from each
population.
1)
Population 1 (X) has mean X and standard

deviation X.
Likewise, population 2 (Y) has mean Y and
standard deviation Y.
The sample means and will be different,
even if the population means X and Y are the
same.
Are and different enough to provide strong
evidence that X and Y are different as well?
365
366
Example: A psychological test (the

Chapin Social Insight Test) is given to a
large number of college students, with a
desire to see if there is a difference in how
men and women score.
Does a test suggest that the populations of

college men and women have different
means on this test?
367
122
Identify H0 and H1. We generally rearrange

our hypotheses to describe a statement
about a difference in the population means.
2)
Two-sided:
H0: X Y = vs. H1: X Y .
One-sided:
H0: X Y vs. H1: X Y > .
or:
H0: X Y vs. H1: X Y < .
Usually, = 0, so that X = Y is part of H0.

368
Example: (Students, continued):
We are interested in showing any difference

between the mens population mean (X) and
the womens population mean (Y), so use a
two-sided alternative.
H0: X Y = 0 vs. H1: X Y 0.
369
3)
A z (or t) statistic always has the form:
In this case, our parameter is X Y.
Our point estimate is
Our null value is .
The standard error of

is
370
123
Our z statistic is therefore
Example: Students:
z=?
Find the P-value.
4)
This is still a z-test. Under H0, z ~ N(0,1). Use

probabilities on z* ~ N(0,1), depending on H1:
H1
X Y
P(|z*| |z|) = 2 P (z* -|z|)
X Y >
P(z* z) = P(z* -z)
X Y <
P(z* z)
P is now the probability that, if H0 is true, a

new sample would give a difference which is
at least as unusual as the one we have.
Example: z = 0.65. P = ?
372
5)
371
The interpretation of P is exactly the same as

for any other hypothesis test.
If P , the evidence is pretty strong against
H0, and we say we reject H0 (at the level).
We have strong evidence of H1.
If P > , our test statistic is pretty reasonable
under H0, so we fail to reject H0. H0 is
plausible (although probably so is H1).
Example: What does our P say about the

populations of student men and women?
373
124
Example: It is claimed that (on average)

tensile strength should be at least 8
N/mm2 greater for 12mm-diameter steel
rods than for 10mm-diameter rods.
Samples of size 50 give:
Can we confirm the claim?
374
Two-Sample Confidence Intervals
Instead of, or in addition to, a two-sample

test, we may desire a confidence interval
for the difference of the population means,
X Y.
The same results that allow us to conduct

a test also allow computation of this
interval.
A 100(1-)% confidence interval for

X - Y will be
Interpretation of the confidence level is

exactly the same as in the one-sample
case.
375
376
125
Example: Students:
Find a 95% confidence interval for the

difference in population means between
men and women.
377
Example: Steel rods:
Find a 99% lower bound on the difference

in population means.
378
Two-Sample t-Tests and Intervals (7.3)
Just as in the one-sample case, if at least

one of the sample sizes is small, we run
into the same dangers for estimating the
standard error from the sample as we do
in the single-sample case.
We again use a t-table to compensate.
A two-sample t-test is conducted exactly

as a two-sample z-test, but uses t
probabilities for P-values.
379
126
We find P from the t-table or a computer

package such as Minitab, just as with the
one-sample t-test.
The degrees of freedom to be used can be

calculated as
Confidence intervals use critical values t,/2.

380
Example: Kudzu pulp yield:
Researchers are investigating using kudzu as

an alternative to wood pulp in paper
production.
They wish to determine if adding the chemical
anthraquinone increases the pulp yield.
Treatment: 25 experiments with
Control: 20 experiments without
Is the anthraquinone increasing the

population mean yield? Conduct a twosample t-test.
Construct a 95% confidence interval for

mean improvement.
381
382
127
Inference with Paired Data (7.4)
Sometimes, we can improve an estimate

of population differences by arranging to
collect the data in a paired fashion.
Each observation from population 1 (X)

should be paired with an observation from
population 2 (Y).
This will be effective if the pairing is such

that the pairs tend to be correlated.
383
Example: Heart rate data
Compare two drugs for effect on heart rate

reduction.
For samples of size 20, we could get 40
volunteers and divide them at random into two
groups to get independent samples.
If drug response varies substantially from
subject to subject, it may be better to give
both drugs to each subject (on different
occasions, in random order).
This reduces the effect of subject variability,
and is probably cheaper and easier as well!
Other examples:
Test tire wear for two brands by putting

Brand A on left front wheel and Brand B
on right front wheel (or vice-versa, at
random) on the same cars.
Compare airplane deicing procedures by

using one method on left wing, other on
right wing of the same planes.
384
385
128
Dealing with paired data (Xi, Yi) is actually

simpler than dealing with two independent
samples.
For each observation, compute the

difference Di = Xi Yi, and then conduct a
one-sample z- or t-test or construct a onesample z- or t-confidence interval on the
differences Di, depending on the number
of pairs.
This will give us a test or interval for

D = X Y.
386
Example: (Heart rate data, continued):
Let Xi be the percent rate reduction from the

standard drug for subject i, and let Yi be the
same for the new drug. Let Di = Xi Yi.
Data:
Patient
Xi
Yi
Di
1
28.5
34.8
-6.3
2
26.6
37.3
-10.7
:
:
:
:
20
40.1
40.8
-0.7
387
Example (Heart rate, continued):
Is there enough evidence to conclude that

the new drug is more effective at reducing
heart rate than the old one on average?
Construct a 95% paired t-interval for

d = 1 2.
388
129
Example Minitab output:

Two-sample tests, intervals
Results for: kudzu.txt

Two-Sample T-Test and CI: With, Without
Two-sample T for With vs Without

N
Mean
StDev
SE Mean
With
25
44.18
3.99
0.80
Without
20
38.56
3.63
0.81
Difference = mu (With) - mu (Without)

Estimate for difference:
5.62500
95% lower bound for difference:
3.71000
T-Test of difference = 0 (vs >): T-Value = 4.94
P-Value = 0.000
DF = 42
389
Results for: heartrate.txt

Paired T-Test and CI: StdDrug, NewDrug
Paired T for StdDrug - NewDrug
Mean
StDev
StdDrug
40
31.1825
4.8318
SE Mean
0.7640
NewDrug
40
33.8375
4.9379
0.7808
Difference
40
-2.65500
3.73012
0.58978
99% CI for mean difference: (-4.25208, -1.05792)

T-Test of mean difference = 0 (vs not = 0): T-Value = -4.50
P-Value = 0.000
390
Chapter 9: Analysis of Variance
In Chapter 7, we looked at comparing the

means of samples from two populations.
Suppose we have samples from more

than two populations, and we wish to test
whether all of the populations have the
same mean.
We use an extension of two-sample tests

called Analysis of Variance (ANOVA).
ANOVA generates P-values using another

important distribution, the F distribution.
391
130
The F distribution (7.5)
The F distribution is another distribution

commonly used in hypothesis tests.
It is a skewed distribution with support on the

positive real line.
It has two degrees of freedom parameters, 1

and 2.
If X ~
(X has an F distribution with 1 and
2 degrees of freedom),
392
393
Table A.6 contains critical values similar to those

for the t distributions.
With a separate table needed for each

combination of 1 and 2, we only get critical
points for a few values of .
We have tables for = 0.10, 0.05, 0.01, and

0.001.
F tests are generally upper-tailed, so these

generally give us what we need.
Minitab and other packages can give us more

precise P-values.
394
131
395
Example: x ~ F5,7
For what c is P(x c) = .10?
For what c is P(x c) = .01?
If an F statistic is 10 (with degrees of freedom

5 and 7), what can we say about an uppertailed P-value?
396
Single Factor Analysis of Variance (9.1)
Suppose we have I populations (I 3).

The populations are often called levels.
The variable identifying the levels is called

a factor.
A factor variable may be categorical, but it

also may identify different treatment
groups in a controlled experiment.
397
132
From each population i we take an

independent sample of size Ji.
Note: all variances and standard

deviations are assumed to be equal.
The total sample size is

N = J1+ J2 + + JI.
398
We wish to test
H0: 1= 2 = = I vs.
H1: Two or more of the i are different.
We begin by estimating the individual

population means as
and the common, or grand, mean (if H0 is
true) as
399
Example: Artery data

Stenosis
Level 1
Level 2
Level 3
Flowrate
10.6
11.7
19.6
(ml/s) at
9.7
12.7
15.1
collapse
8.3
17.6
16.6
11
11.209
14
15.086
10
17.330
Ji
N = 11 + 14 + 10 = 35
400
133
401
402
The variability between levels is estimated

by the treatment sum of squares.
The variability within levels is estimated by

the error sum of squares.
403
134
The total variability of the dataset is found

as the total sum of squares.
where s2 is the sample variance of the full

dataset.
Note that
SST = SSTr + SSE.

404
Example (Artery data):
405
To test our hypotheses, we compare these

two measures of variability in an F-statistic
The numerator and denominator of this

statistic are called the mean square for
treatments and the mean square error,
respectively.
406
135
If the null hypothesis is true, then the

mean square for treatments and the mean
square error are both estimates of the
common variance, 2, and
Large differences between the level

means will lead to large values of F, so the
P-value is defined as
P = P(X F), where
Example (Artery data):
F=?
P-value?
Conclusion?
407
408
Results for: arteries.txt

One-way ANOVA: Collapse Flowrate versus Amount of Stenosis
Source
Amount of Stenos
Error
Total
S = 2.080
Level
level 1
level 2
level 3
DF
2
32
34
SS
204.02
138.47
342.49
R-Sq = 59.57%
N
11
14
10
Mean
11.209
15.086
17.330
StDev
1.899
2.150
2.168
MS
102.01
4.33
F
23.57
P
0.000
R-Sq(adj) = 57.04%
Individual 95% CIs For Mean Based on Pooled

StDev
+---------+---------+---------+--------(----*----)
(---*----)
(----*-----)
+---------+---------+---------+--------10.0
12.5
15.0
17.5
Pooled StDev = 2.080
409
136
Pairwise Comparisons (9.2)
If an ANOVA test rejects H0, it suggests

that at least some of the level means are
different from one another, but does not
automatically identify which ones.
We can do two-sample tests at this point,

but doing many tests risks false positives
(Type I error recall Section 6.8).
We could use a Bonferroni correction, but

this is unnecessarily conservative.
410
Instead, we should use the Tukey-Kramer

multiple comparisons procedure, which
adjusts for the number of tests.
Such a procedure will have probability of

one or more Type I errors (false significant
differences) out of the full set.
The more tests conducted, the smaller the

individual probabilities of Type I error (and
the more conservative the individual tests)
must be.
411
In addition to the level means and sample

sizes, we need an estimate of the common
variance, 2.
The mean square error,

MSE = SSE/(N - I)
is an estimate of 2, so we use that.
412
137
We also need a critical value from Table A.7, the

Studentized range distribution, which adjusts for
the multiple comparisons.
We use qI,N-I,.
A 100(1- )% confidence interval for the

difference between level means i and j is
Then i and j are considered significantly

different at level if this interval does not
include 0.
413
Example: (Artery data)
q3,32,.05 3.49
Find 95% simultaneous c.i.s for:
2 1:
3 1:
3 2:
Which levels are significantly different?

414
Tukey 95% Simultaneous Confidence Intervals

All Pairwise Comparisons among Levels of Amount of Stenosis
Individual confidence level = 98.06%
Amount of Stenosis = level 1 subtracted from:
Amount
of
Stenosis
level 2
level 3
Lower
1.814
3.884
Center
3.877
6.121
Upper
5.939
8.357
--+---------+---------+---------+------(-----*-----)
(-----*------)
--+---------+---------+---------+-------3.5
0.0
3.5
7.0
Amount of Stenosis = level 2 subtracted from:

Amount
of
Stenosis
level 3
Lower
0.125
Center
2.244
Upper
4.364
--+---------+---------+---------+------(-----*-----)
--+---------+---------+---------+-------3.5
0.0
3.5
7.0
415
138
Model Assumptions
Remember that ANOVA is based on a

model of independent draws from normal
populations with a common variance.
Minor deviations from normality or a

common variance will not have a strong
effect, but large deviations will require
other techniques.
Histograms, boxplots, and prior

experience can all be useful guides.
416
Inference for Population Proportions
We used the known sampling distribution

of to construct confidence intervals for
its related parameter, .
We can do the same thing with the

sampling distribution of the sample
proportion, , to construct tests and
confidence intervals for a population
proportion or probability, p.
417
Z-tests for Proportions (6.3)
Recall from 5.1, if X ~ Binomial(n, p) and
If np 10 and n(1 - p) 10, then n is large

enough that the Central Limit Theorem tells
us that is approximately normal as well.
We can use this sampling distribution to do

hypothesis tests on p. Since we use the
normal table, these will be z-tests.
418
139
We have a single population and a specific

value, p0, we wish to consider for the
population probability of success or
population proportion of successes.
1)
A sample from the population will give a

sample proportion different from p0, even if
that is the actual population proportion.
Example: We have a possibly biased coin.
We wish to test whether or not
p = P(Heads) = 0.5 = p0.

419
Identify H0 and H1.
2)
Set up H1 as the statement that something

interesting is going on, or what we wish to
prove.
Choose a one-sided or two-sided alternative,
depending on our purpose.
Two-sided: H0: p = p0 vs. H1: p p0.
One-sided: H0: p p0 vs. H1: p > p0
or:
H0: p p0 vs. H1: p < p0
Example: Coin: H0: p= 0.5 vs. H1: p 0.5.
420
3)
Just as with tests on , we find the z test

statistic
Example: 400 coin flips, 176 heads.
=?
z=?
421
140
Find the P-value.
4)
This is still a z-test. Under H0, z ~ N(0,1). Use

probabilities on z* ~ N(0,1), depending on H1:
H1
P(|z*| |z|) = 2 P (z* -|z|)
> 0
P(z* z) = P(z* -z)
< 0
P(z* z)
Now this is the probability that, if H0 is true, a

new sample would give a
which disagrees
we have.
with H0 at least as much as the
Example: z = -2.40. P = ?
422
Choose a small , and reject H0 for P < .

We have strong evidence against the null
hypothesis.
5)
Otherwise, fail to reject H0. The null

hypothesis is plausible.
Example:
H0: p = 0.5 vs. H1: p 0.5.

z = -2.40, P = 0.0164.
If = 0.05, what should we conclude?
423
Example: 600 students, 217 knew the

author of The Canterbury Tales. Should
we believe that more than 1/3 of all
students know this?
424
141
Minitab will do the same test using exact

binomial probabilities for a slightly more
accurate P-value.
Test and CI for One Proportion

Test of p = 0.5 vs p not = 0.5
Exact
Sample
1
Sample p
95% CI
P-Value
176
400
0.440000
(0.390707, 0.490187)
0.019
Test and CI for One Proportion

Test of p = 0.5 vs p not = 0.5
Sample
1
Sample p
95% CI
176
400
0.440000
(0.391355, 0.488645)
Z-Value
P-Value
-2.40
0.016
425
Minitab may also be used to conduct

power calculations for tests of proportions.
Power and Sample Size

Test for One Proportion
Testing proportion = 0.5 (versus not = 0.5)

Alpha = 0.05
Alternative
Sample
Target
Proportion
Size
Power
Actual Power
0.55
783
0.8
0.800239
426
Confidence Intervals for

Population Proportions (5.3)
We have:
Isolating p in the probability statement is

trickier than in the case for , because it
appears in both the numerator and the
denominator.
427
142
In the past, people usually replaced the

unknown ps in the standard error with the
known , so the 95% confidence interval
for p would be
This traditional interval has the same

format as the -interval:
It works well for large n, but is no longer

recommended.
428
Unfortunately, the probability of containing

p in the interval can be well below 95% for
smaller n.
It turns out this can be corrected by adding

2 successes and 2 failures to our counts:
Then a 100(1-)% confidence interval for

p will be
for all n.
429
Example: In a blind taste test, subjects

choose between instant and fresh-brewed
coffee. Out of 40 subjects, 12 prefer the
instant coffee. If p is the probability that a
random person prefers instant coffee in a
blind test, find a 95% confidence interval
for p.
430
143
Example: Suppose we flip a (possibly

biased) coin. Let p be the probability of a
head. If 100 tosses result in 45 heads,
find a 95% confidence interval for p. Is it
plausible that our coin could be fair?
If 1000 tosses give 450 heads, now what

is our confidence interval for p?
431
If we have an idea of the value of p, and we

require a 95% confidence interval of error bound
(interval half-width) no more than w, we can
compute a minimum sample size.
Of course, we can substitute the appropriate z

critical value to find sample sizes for other
confidence levels.
If p is unknown, we should replace it with

the conservative value 0.5, and require a
minimum sample size of
The associated error bound of
432
is the margin of error that surveys and

polls generally report.
433
144
Example: If we take a survey and want a

2% margin of error (w = 0.02), how big a
sample must we take?
What if were willing to settle for a 3%

margin of error?
434
Two-Proportion Inference (7.2)
Just as we can use hypothesis tests and

confidence intervals to compare means of
two related populations, we may also use
them to compare two related binomial
probabilities or population proportions.
Suppose X ~ B(nX, pX), and Y ~ B(nY, pY).
We estimate the probabilities with
435
If we want a confidence interval on the

difference between two proportions, we
can use the fact that for independent X
and Y with large nX and nY,
436
145
Recall that for a modern confidence

interval on a single proportion, we used an
alternative estimate of p, adding two
successes and two failures to our counts:
With two samples, we distribute the extras

between the two estimates:
437
Our 100(1 )% confidence interval is
Example: Are rural households more likely

to use a natural Christmas tree than urban
ones?
Rural: nX = 160, X = 64
Urban: nY = 261, Y = 89
Find a 95% confidence interval for pX pY.
438
To test H0: pX = pY = p vs. H1: pX pY,

we must estimate the common null
proportion p with the pooled proportion:
Then our z statistic may be found as
Compute P-values as for any other z test.

439
146
Example: Natural Christmas trees
Rural: nX = 160, X = 64
Urban: nY = 261, Y = 89
Can we conclude that rural households are
more likely to use a natural Christmas tree
than urban ones?
440
Example: Are male college students more

likely to be frequent binge drinkers than
female students?
Of 9916 college women surveyed, 1684 were

classified as frequent binge drinkers.
Of 7180 college men surveyed, 1630 were
considered frequent binge drinkers.
Is there a significant difference in the
populations?
441
Minitab:
Test and CI for Two Proportions
Sample
Sample p
1684
9916
0.169827
1630
7180
0.227019
Difference = p (1) - p (2)

Estimate for difference:
-0.0571930
95% upper bound for difference:
-0.0469659
Test for difference = 0 (vs < 0):
Z = -9.34
P-Value = 0.000
442
147
Chi-Squared Tests (6.5)
A number of common tests use members

of the chi-squared family of distributions as
null distributions.
We will study chi-squared tests for

hypotheses involving categorical variables
with more than two categories.
This includes tests of proportions

comparing more than two populations.
443
The Chi-Squared Distribution
The chi-squared distribution is skewed,

with support on the positive real line.
It has one degrees of freedom parameter,

.
If Y ~
, then E(Y) = and V(Y) = 2.
444
445
148
Table A.5
contains
important critical
values,
If
then
446
Example: X~
P(X 7.815) = 0.05

P(X 11.345) = 0.01
Example: X~
P(X 9.236) = ?
Find c such that P(X c) = 0.025.
447
Chi-Square Tests on One

Categorical Variable
In many situations, outcomes are broken

down into multiple categories.
If there are only two categories of interest

(success / failure), then we study the
probability of a success, p, and test using
the z-test.
448
149
If there are more than two categories of

interest, we analyze them in a different
way.
Just as with binomial data, we generally

dont record each observation individually.
Instead, we record the number of times
each category occurs, in a contingency
table.
449
Example: A die is rolled 90 times, and we

record the numbers rolled. The
contingency table might look like:
Roll
Total
Count
14
20
17
10
21
n = 90
450
Example: Factory floor accidents are

recorded by day of the week.
Day
Th
Total
Count
65
43
48
41 73 n = 270
Example: Snapdragon color genotype may

be any of three colors.
Color
Red
Count
57
Pink White
89
54
Total
n = 200
451
150
Suppose we have N independent trials.

Each trial may result in category 1 with
probability p1, category 2 with probability
p2, and so on up to category k with
probability pk. (Note: p1 + + pk = 1.)
Let Oi be the observed count of trials in

category i, i = 1, k.
452
Suppose we wished to test whether the die

was fair, or whether accidents were equally
likely each day of the week, or whether the
snapdragons satisfy standard genetic theory.
Such a hypothesis can be written:

H0: p1 = p10, , pk = pk0.
Example (Die): H0: pi = 1/6, i = 1, 6.

Example (Factory): H0: pi = 1/5, i = 1, 5.
Example (Snapdragons): H0: p1 = .25, p2 = .5,
p3 = .25
453
The many-sided alternative, H1, here is

simply that at least one of the probabilities
is incorrect. This is often left implied.
We test these hypotheses by comparing

the observed cell frequencies O1, , Ok,
with expected cell frequencies E1,, Ek.
If the probability of category i is pi0, then

the expected count in this category from N
trials is Ei = Npi0.
454
151
We need an expected frequency for each cell

in our table.
Example (Die): N = 90, pi0 = 1/6, so

Ei = 90/6 = 15,
i = 1, , 6.
Example (Factory): N = 270, pi0 = 1/5, so

Ei = 270/5 = 54,
i = 1, , 5.
Note that the Eis might not be integers, and

will only be equal if the pi0s are.
Example (Snapdragons): N = 200, so

E1 = E3 = 200*.25 = 50, E2 = 200*.5 =100.
455
Once we have an observed and an

expected frequency for each cell in our
table, we compute a test statistic.
This statistic will compare the two sets of

frequencies, with a larger value indicating
less similar sets.
We will compute P-values using the chisquare table, so these are referred to as
chi-square statistics (and tests).
456
The chi-square statistic:
X2 gets larger as Oi differs from Ei for any

cell, in either direction.
Cells where Ei is larger require a larger

difference to contribute as much to X2.
457
152
We compute P-values by comparing X2 to

a chi-square distribution with (k-1) degrees
of freedom (one less than the number of
cells).
P-value = P(k-12 X2)
We can reject H0 at the level if our

statistic is larger than the critical point
458
Example (Die):
Roll
Oi
1
14
2
20
3
17
4
10
5
8
6
21
Total
N = 90
Ei
15
15
15
15
15
15
N = 90
d.f. = 5,
0.05 P 0.10
459
Example (Snapdragons):
Color
Oi
Red
57
Pink
89
White Total
54
N = 200
Ei
H0: p1 = 0.25, p2 = 0.5, p3 = 0.25
X2 = ?
P?
460
153
Two-Way Tables and Testing

Independence
Sometimes we collect data on more than

one categorical variable at once. We can
present the results for two such variables
in a two-way table, with one variable in
rows, the other in columns.
Recall, such data can be plotted in

clustered bar charts.
461
Example: Handedness by Sex
Right-handed
Men
934
Women
1070
Total
2004
Left-handed
113
92
205
Ambidextrous
20
28
Total
1067
1170
2237
462
463
154
We generalize the notation as:

1
O11
O12
O21
O22
Oi1
Oi2
Row
Totals
O1j
O1J
O1
O2j
O2J
O2
Oij
OiJ
O i
OI1
OI2
OIj
OIJ
OI
Column
Totals
O1
O2
Oj
OJ
O=N
464
The null hypothesis we usually wish to test

from such two-way data is that the two
variables are independent. That is, the
probability of seeing one level in variable 1
does not depend on the level in variable 2
and vice-versa.
We test this hypothesis against an

alternative of dependence using another
chi-square test.
Note: We have no specific probabilities in

mind here.
465
Alternatively, if one of our variables

represents several related populations, our
null hypothesis is that these populations
are homogenous with respect to the
remaining variable.
That is, the probability for each category is

the same for each population.
This will be tested exactly the same way

as independence.
466
155
The observed cell frequencies are the counts

Oij.
The expected cell frequencies for a test of

independence are found as
We still compute X2 as before, summing over

all cells.
We use a chi-square with (I 1)*(J 1)

degrees of freedom to find our P-value.
467
Example: Handedness by Sex
Right-handed
Men
934
Women
1070
Total
2004
Left-handed
113
92
205
Ambidextrous
20
28
Total
1067
1170
2237
X2?
d.f.?
P?
468
Minitab:
Chi-Square Test: Men, Women

Expected counts are printed below observed counts
Chi-Square contributions are printed below expected counts
Total
Men
Women
Total
934
1070
2004
955.86
1048.14
0.500
0.456
113
92
97.78
107.22
2.369
2.160
20
13.36
14.64
3.306
3.015
1067
1170
205
28
2237
Chi-Sq = 11.806, DF = 2, P-Value = 0.003

469
156
Ch. 8: Inference in Regression

Recall our analyses of Chapter 2, where
we looked at bivariate data.
Example:
January and February inflows of the Nile river

at a location.
Regression involves modeling and

predicting the values of one response
variable, based on the observed values of
one or more other explanatory variables.
470
Inference in Simple Linear Regression

The simple linear regression model fits a
straight line to a set of paired data
observations.
Formally:
yi = 0+ 1xi+ i
0 and 1 are (unknown) constants
1,,n are assumed to be independent draws
from a N(0, 2) distribution.
yi ~ N(0+ 1xi, 2)
E(yi) = 0+ 1xi
471
The most common way to estimate and

uses the least squares fit, minimizing
This leads to the least squares estimates,
472
157
Recall: Given a data set (xi, yi) and an

associated fitted regression model, the fitted
value for observation i is
The residual for i is
The error sum of squares (SSE) is found as
An estimate of , the standard deviation

around the regression line is
473
Also recall the total sum of squares, SST:

and the regression sum of squares, SSR:
which give us the computing formula
SSE = SST SSR.
The coefficient of determination, r2, measures

the proportion of the total variation of y which
is explained by x:
474
Inference in simple linear regression

usually focuses on , the estimate of the
slope parameter 1, which measures how
much y changes for a one-unit change in x.
Just like other sample statistics,

sampling distribution.
Under our model, this distribution is known

and may be used to construct confidence
intervals and hypothesis tests.
has a
475
158
Under the standard regression model (slide

471),
We may test H0: 1 = 10 using test statistic t

with 10 in place of 1 and using a t table for
n-2 degrees of freedom to find a P-value.
Most commonly, 10 = 0.
476
Example: Nile data (testing H0: 1 = 0):
n = 115
= .836
s = .331
= 119.25
=?
t=?
=?
P?
477

February = - 0.470 + 0.836 January
Predictor
Constant
January
Coef
-0.4698
0.83617
SE Coef
0.1257
0.03027
T
-3.74
27.63
P
0.000
0.000
478
159
Can we conclude that the population slope

1 is greater than 0.8?
We can use the distribution on t to

construct a confidence interval for 1 as
Example: Nile data (95% c.i. for 1):
479
480
Inference in Correlation
Recall the definition of the sample correlation

coefficient.
Then
.
481
160
The population correlation is denoted .
Suppose we wish to test the hypotheses

H0: = 0 vs. H1: 0.
We can test this with a t test.
Our test statistic is
Under H0, U ~ tn-2, so we use the t table to

estimate our P-value.
482
Example: State data: n = 50
High school graduation vs. murder:
r = -.488
U=?
d.f. = ?
P?
483
Correlations: Illiteracy, HS Grad,

Murder
HS Grad
Illiteracy
-0.657
0.000
HS Grad
0.703
0.000
-0.488
0.000
Murder
Cell Contents: Pearson correlation

P-Value
484
161
Residual Plots (8.2)
Examination of residuals is an important

check on a regression analysis.
Generally, we look at a residual plot of xi

(or ) versus ei.
If the residual plot is flat, even, and

appears random around e = 0, everything
is probably fine.
485
486
Residual outliers indicate points which are

not well fit by the model. Check for
explanations, and possibly remove those
points.
487
162
Nonlinear patterns in the residual plot

suggest a linear fit is inappropriate.
488
Funnel-shaped plots mean your data is

heteroscedastic (different scatter), meaning the
standard deviation of y is not constant it
depends on x. Fitted values may still be
reasonable, but r2 and s may not mean much.
489
Power Transformations
Skewness (including outliers), curvature,

and heteroscedasticity can often all be
improved by the use of nonlinear
transformations on y, on x, or on both.
Such transformations include such options

as logs, square roots, and reciprocals (1/x).
Apply the transformation to all observations

on this variable.
490
163
Example: Tortilla frying time in seconds (x) vs.

moisture content in % (y), shows a curved,
decreasing relationship.
491
Taking logs of x and y leads to a quite

linear plot.
492
Finding a good combination of transformations on

x and y may require some experimentation and
patience.
Once we have a linear scatterplot, we may fit the

transformed variables. Inverse transformations
may give us models for x and y.
Example: Tortilla data. Fitting ln(y) to ln(x) gives

us
ln(y) = 4.64 - 1.05 ln(x)
and taking antilogs on each side gives us
y = 103.4 x- 1.05.
493
164
Inference in Multiple Regression (8.3)
Often we are interested in the relationship

between a response variable and multiple
explanatory variables.
Visually, we can inspect these

relationships with a matrix plot (or
scatterplot matrix).
Each pair of variables appears as two

scatterplots, one in each orientation.
Example: In late 1980, the city of

Concord, New Hampshire, began a
campaign to encourage water
conservation.
We wish to examine household water

usage (in cubic feet) in 1981 based on
1980 usage and a variety of other
household variables.
494
495
496
165
We can predict y from multiple xs using

multiple regression.
Our model looks like the one for simple

linear regression, with additional x terms.
yi = 0 + 1x1i + 2x2i + + pxpi + i.
Coefficient j measures the amount we

expect y to change when increasing xj by
one unit, while holding all of the other xs
constant.
497
We fit our coefficients again using the

method of least squares.
The computations are much more

complex than in the simple linear
regression case.
They are most easily represented and

computed using matrix methods difficult
by hand, but easy for a computer.
498
Regression Analysis: WATER81 versus WATER80, INCOME, ...

WATER81 = 412 + 0.489 WATER80 + 0.0193 INCOME - 43.7 EDUCATION
+ 235 PEOPLE81 + 96.6 CHPEOPLE
Predictor
Constant
WATER80
Coef
SE Coef
412.0
189.0
2.18
0.030
0.48885
0.02638
18.53
0.000
0.019271
0.003368
5.72
0.000
EDUCATION
-43.65
13.23
-3.30
0.001
PEOPLE81
234.71
28.00
8.38
0.000
CHPEOPLE
96.56
80.76
1.20
0.232
INCOME
499
166
S = 851.914
R-Sq = 67.5%
R-Sq(adj) = 67.1%
Source
Regression
DF
SS
MS
737617962
147523592
203.27
0.000
725757
Residual Error
490
355620748
Total
495
1093238710
500
501
Example: If two households are the same

in all respects except that one includes an
additional person, how much additional
water should we predict the larger
household will use?
502
167
Example: Suppose a household:
used 5000 cubic feet of water in 1980,

contained 4 people in both 1980 and 1981,
had a household income of $25,000, and
had a head of household with 12 years of
education.
Predict the water usage for this household

in 1981.
503
As in simple regression, we measure the

effectiveness of our model by
and
and combine them with the coefficient of
multiple determination
504
As in the simple linear regression case, R2

may be interpreted as the proportion of
variance in y explained by our model and
all of the xs.
It is not a simple correlation squared any

more.
Example: What percentage of the

variability in 1981 water usage is
explained by our model?
505
168
R2 is guaranteed to increase if you add

another x to your model, even if it is
related to y only by chance.
Because of this, many analysts prefer the

adjusted R2 for multiple regression,
especially when comparing models with
different numbers of explanatory variables.
506
We can still use t-tests and confidence

intervals on the individual coefficients in a
multiple regression model.
The standard errors are complicated to

compute, but available in output from
Minitab and other packages.
The degrees of freedom for the t-table will

be [n (p + 1)].
507
We can also conduct a test of model utility

on the entire model at once.
The null hypothesis is that all slope

coefficients are 0 (so none of the xs are
useful in predicting y).
The (F) test statistic is
Under H0, F has an F distribution with p

and [n (p + 1)] degrees of freedom.
508
169
The F test and P-value may be found in

the ANOVA table in regression output from
Minitab and other statistical packages.
The test of model utility is also found in the

simple regression case, but is completely
equivalent to the t-test on 1 for this case.
509
Example: What does the test of model

utility say about our multiple regression of
1981 water usage?
What do the t-tests say about the

individual terms in the model?
510
Interactions and Polynomials
An important special case of multiple

regression involves the use of interaction
(product) terms.
yi = 0 + 1x1i + 2x2i + 12x1ix2i + i.
The effect of increasing x1 on y depends

on the value of x2.
Interpretation becomes more difficult, but

relationships suggested by interaction
terms may be important and interesting.
511
170
Regression Analysis: WATER81 versus WATER80, INCOME, ...

WATER81 = -769 + 0.974 WATER80 + 0.0213 INCOME + 39.5 EDUCATION
+ 217 PEOPLE81 - 0.0336 WATER80*EDUCATION
Predictor
Constant
WATER80
Coef
SE Coef
-768.9
313.4
-2.45
0.014
0.9742
0.1090
8.93
0.000
0.021263
0.003310
6.42
0.000
EDUCATION
39.55
22.25
1.78
0.076
PEOPLE81
216.57
27.52
7.87
0.000
-0.033617
0.007275
-4.62
0.000
INCOME
WATER80*EDUCATION
S = 835.152
R-Sq = 68.7%
R-Sq(adj) = 68.4%
512
What is the effective coefficient (simple

slope) on Water80 for a household whose
head has 8 years of education?
12 years?
16 years?
513
Another important special case is polynomial

regression.
yi = 0 + 1xi + 2xi2 + i.
This allows fitting polynomial curves to nonlinear

scatterplots.
Example: Yield
(kg/ha) vs. time
to harvest (days
after flowering)
for paddy, a
grain from India.
514
171
Polynomial Regression Analysis: Yield versus Time

Yield = - 1070 + 293.5 Time - 4.536 Time**2
S = 203.883
R-Sq = 79.4%
R-Sq(adj) = 76.2%
Source
Regression
DF
SS
MS
2084779
1042390
25.08
0.000
41568
Error
13
540388
Total
15
2625168
515
Note that regression with higher powers or

interaction terms is still considered linear
regression, not because it is linear in the
xs (its not), but because it is linear in the
parameters being estimated, the s.
516
Intelligent Consumption of Statistics
Even most statisticians deal with pregenerated statistics in journals and the
news far more frequently then they are
called upon to compute them themselves.
The ability to read and understand such

statistics with a properly critical eye and
brain (statistical literacy) is one which
should be expected of any educated adult.
517
172
Consider the following newspaper items:
In 2008, the export value of opium from

Afghanistan was $3.4 billion.
Three out of every 1,000 patients who have
their stomachs stapled will die within three
months.
Each year, about 1,100 suicides occur on
U.S. college campuses.
How firmly should we believe these?
Remember, any observed value can be

broken down into three parts:
the true value (what wed like to know)

randomness (unavoidable)
nonstatistical mistakes (what to watch for)
Too often, we assume the observed value

is the true value, and forget the other
components.
519
We deal with randomness by
518
insisting on large samples wherever possible,

reporting standard errors,
using confidence intervals instead of point
estimates,
testing for statistical significance,
and so on.
Nonstatistical mistakes are trickier.

520
173
Dangers in Data Collection

Remember, our statistical methods of data
analysis all assume weve collected our
data through planned introduction of
chance.
This means:
Randomized, controlled experiments or

Random samples.
521
Many experiments use no control group,

or a nonrandomized one (such as
historical data)
Example: A study on coronary bypass surgery

showed its subjects survived longer than
historical controls.
The sickest subjects couldnt be given the
surgery (they likely wouldnt survive it), so
were excluded from the study.
A randomized controlled study showed only
minor survival differences between the
groups.
522
Many studies use samples of

convenience.
Psychologists often require their students to

participate in experiments.
Critics have suggested modern psychology be
renamed psychology of the college
sophomore.
523
174
Even a random sample may be poor if

drawn from a different population than
implied.
A university does not provide daycare for

children of students. A sample of students of
the university asking if they require daycare to
attend classes is virtually certain to have a
small percentage (at most) saying yes. This
is not useful for determining if providing
daycare would allow other parents to attend
classes.
524
If a report on a study gives no information

about how the sample was collected, take
the results with a few barrels of salt,
especially if they seem unreasonable
otherwise.
525
Dangers of Survey Research
Survey results are among the most

commonly reported statistical results, yet
they have some of the greatest dangers
associated with them.
Knowing the details of a survey is critical

to evaluating the results it reports.
526
175
Defining the population is critical, but

finding it once it is defined is often difficult
or impossible.
Teenage mothers in Grand Forks County is

a well-defined population, but it could be hard
to find a list.
527
Selection and nonresponse biases will

often produce skewed results.
In 1936, Franklin Delano Roosevelt (a

Democrat) was running for his second
term as president against Republican Alf
Landon.
Literary Digest magazine sent out

questionnaires to 10 million people from
phone books, club membership lists, and
magazine subscription lists.
528
From the 2.4 million responses they

received, they predicted Landon would win
57% to 43%. On election day, Roosevelt
won 62% to 38%.
What went wrong?
529
176
In 1987, Shere Hite published The Hite

Report: Women and Love. This was a
study based on a long, essay-type survey
of women on love and sex.
Out of 100,000 questionnaires distributed

to organizations like church groups,
political organizations, and counseling
centers, Hite received 4,500 back.
530
From this, she generated claims (and

headlines!) like 70% of women married
five years or more are having sex outside
of their marriages.
Hite claims that because her sample was

large, and well-matched to census data in
factors such as race, income, and
geographic region, that her results can be
taken as representative of the country as a
whole.
Is this a valid claim?

531
The wording of questions is critical; ignore

a survey that will not provide the exact
questions asked.
13% of Americans think we spend too much

on assistance to the poor.
44% think we spend too much on welfare.
532
177
August 15-17, 2009, NBC/Wall St. Journal poll:
Would you favor or oppose creating a public health

care plan administered by the federal government that
would compete directly with private health insurance
companies?
o
43% favor, 47% oppose
August 19, 2009, Survey USA poll:
In any health care proposal, how important do you

feel it is to give people a choice of both a public plan
administered by the federal government and a private
plan for their health insurance extremely important,
quite important, not that important, or not at all
important?
o
77% extremely (58%) + quite (19%), 22% not that (7%) + not
at all (15%)
533
From Republican
congressman
John Culbersons
web page:
534
Sept. 19-22, 2008:
Pew Research: As you may know, the government is

potentially investing billions to try to keep financial institutions
and markets secure. Do you think this is the right thing or the
wrong thing for the government to be doing?
57% Right thing, 30% Wrong thing
L.A. Times/Bloomberg: Do you think the government should

use taxpayers' dollars to rescue ailing private financial firms
whose collapse could have adverse effects on the economy
and market?
31% Yes, 55% No
Wash. Post/ABC News: Do you approve or disapprove of the

steps the Federal Reserve and the Treasury Department
have taken to try to deal with the current situation involving
the stock market and major financial institutions?
47% Approve, 42% Disapprove
535
178
Even if not slanted, a questions wording

may be confusing, and thus bias the
results.
1992: Does it seem possible or does it seem

impossible to you that the Nazi extermination
of the Jews never happened?
22% said possible.
1994: Does it seem possible to you that the
Nazi extermination of the Jews never
happened, or do you feel certain that it
happened?
1% said possible.
536
Who is asking the question can influence

the answer.
Two teams of interviewers, one white, one

black, surveyed Southern blacks during World
War II, asking if blacks would be treated better
or worse if Japan conquered the U.S.
Black interviewers: 9% better, 25% worse.
White interviewers: 3% better, 45% worse.
537
This desire to look good and please the

interviewer can affect the results no matter
who does the questioning.
A study on toothbrushing habits found that

if people brushed as much as they
claimed, toothpaste sales would be three
times higher than they actually were.
538
179
In December, 2012, Public Policy Polling found

that 39% of Americans had an opinion about the
Simpson-Bowles deficit reduction plan.
They also found that 25% had an opinion on the

Panetta-Burns plan, even though the latter didnt
exist!
In November, 2010, California voters

voted on Proposition 19 to legalize, tax,
and regulate recreational marijuana use.
As of July, 2010, 6 telephone polls had

been conducted:
539
3 polls with human interviewers showed the

proposition being defeated by 1, 2, and 4
percentage points.
3 automated polls (robopolls) showed the
proposition passing by 10, 14, and 16 points.
540
Dangers of Inference
Garbage In Garbage Out
Statistical procedures dont check the data. If

there are issues with the data collection, there
will still be perfectly good-looking results from
the computer.
Thats why checking the study design is so
critical.
541
180
Data snooping
If we observe an extreme result, and then

conduct a test, we should not be surprised
when that test returns a significant result.
Ex: A town of 50,000 has very high voltage
power lines. One year, the rate of a particular
type of cancer is 3 times the national
average.
A test of significance gives a p-value of
0.0002 = 1/5,000. Are the power lines
causing cancer?
542
If you split the U.S. population of more than

250,000,000 into sets of 50,000, there would
be more than 5,000 of them. By chance,
youd expect at least one to have such a high
rate. Since the high rate led us to test, its not
convincing yet.
If the high rate were to persist over several
years, it would suggest that something in the
town was causing it. (Remember, correlation
is not causation!)
Important studies should be replicated to be
truly convincing.
543
Does the difference prove it?
Charles Tart wanted to prove ESP. He built a

device called the Aquarius which chose one
of 4 targets, which the subject was supposed
to predict.
Out of 7,500 guesses from 15 clairvoyant
subjects, 2,006 were hits. Compared to a null
of p=1/4, this gives a p-value of 0.0002.
Did this prove ESP?
544
181
When checked, it came out that the random

number generator almost never picked the
same target twice in a row.
By selecting a different choice for the next
guess after the target lit up, a subject could
almost have a 1/3 chance of a hit.
A replication with an improved r.n.g. showed
no significant results.
The results of the first experiment werent due
to chance with p=1/4, but they werent due to
ESP either.
Statistical tests wont check your experimental
design.
545
How much should we believe?
The opium value was unsourced in an

editorial. Google found a report on The
Afghan Opium Survey 2008 from the
United Nations Office on Drugs and Crime.
Methodology included use of satellite

imagery and surveys of farmers, villagers,
and traders.
546
Stomach stapling fatality rates are from

the International Bariatric Surgery
Registry. No information how computed.
College suicide rates are estimates by the

American Foundation for Suicide
Prevention. Also no information.
547
182

Math 321 - Statistics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Math 321 - Statistics

Uploaded by

Copyright:

Available Formats

Introduction: What is Statistics?

Definition: Statistics is the science of

More briefly: Statistics is the field of

In statistics, we make observations, to

If that sounds familiar, it should. We do

In statistics, we simply formalize this

The term statistic is also used to describe

These numerical bits of data are thrown at us

Just as words should be read with

Math 321 - Dr. Minnotte

Statistics are an important tool in almost

How can doctors tell if a new vaccine really

What are some other examples of statistics in

The Challenger Disaster:

In 1986, a lack of statistical thinking

The destruction of the Challenger killed

Math 321 - Dr. Minnotte

Math 321 - Dr. Minnotte

Math 321 - Dr. Minnotte

The solid rocket motors used to launch the

The Challenger explosion occurred when one

Implicated in the failure was the unusually

The night before the launch, forecasters

A three-hour teleconference took place

Morton Thiokol (manufacturer of the rocket

There was concern that the cold

In 7 out of 23 previous launches, some Oring damage had occurred.

Some participants recommended delaying

Math 321 - Dr. Minnotte

In the end, the recommendation was made

The plot shows temperature vs. number of

The relationship seems limited, at most.

What error was made preparing this plot?

Math 321 - Dr. Minnotte

Math 321 - Dr. Minnotte

Math 321 - Dr. Minnotte

Math 321 - Dr. Minnotte

By only including the launches in which

When the data from all 23 launches is

Note where 31F or 29F would appear on

Math 321 - Dr. Minnotte

More sophisticated analyses are possible,

Had the concerned engineers presented

Theres more to this story, so well return

Chapter 1: Univariate Data Populations and Samples

In an enumerative study, the population will

Examples include populations of people, or

Math 321 - Dr. Minnotte

In an analytic study, we study an ongoing

Examples include populations of rainfall over time,

As an investigator, you have a great deal

Example: We are interested in the ages of

Example: A quality engineer wishes to

Example: We wish to examine the

Once we have defined our population, we

Measurements from each member of the

Example: Student ages.

Math 321 - Dr. Minnotte

Suppose that a chemical engineer wants

The engineer can run the process several

This sort of experiment is called a

There are many situations in which scientists

Many studies have been conducted to

The experimenter cannot control who