You are on page 1of 280

Introduction

to Statistics and
Data Analysis
Angelo Maria Sabatini

Preliminary considerations
Statistics is about understanding the role the variability plays
in drawing conclusions based on data (numbers in context)







Data analysis process


The data analysis process can be viewed as a sequence of
steps that lead from planning to data collection to making
informed conclusions based on the resulting data:
understanding the nature of the problem
deciding what to measure and how to measure it
data collection
data summarization and preliminary analysis
formal data analysis
interpretation and communication of results


DEFINITION

Descriptive statistics
Branch of statistics that includes methods for organizing and
summarizing data.
Inferential statistics
Branch of statistics that involves generalizing from a sample to
the population from which the sample was selected and
assessing the reliability of such generalizations.

DEFINITION

The entire collection of individuals or objects about which


information is desired is called the population of interest.
A sample is a subset of the population, selected for study.

Types of data and simple graphic displays


DEFINITION

A dataset consisting of observations on a single characteristic


for each experimental unit is a univariate dataset.
A dataset consisting of observations on multiple characteristics
for each experimental unit is a multivariate dataset.
A univariate dataset is categorical if the individual
observations are categorical responses, e.g., sex, membership.
A univariate dataset is numerical if each observation is a
number.
A multivariate dataset can describe each experimental unit
with either categorical or numerical attributes.

Frequency distributions and bar charts


These are the preferred ways to present categorical data
(Relative) frequency distribution
Table that displays the possible categories along with the associated
frequencies and/or relative frequencies.
The frequency for a particular category is the number of times the
category appears in the dataset.
The relative frequency for a particular category is the proportion of
the observations that belong to that category.

Bar chart
Graph of the (relative) frequency distribution of categorical data. Each
category is represented by a bar or rectangle. The area of each bar is
proportional to the corresponding frequency or relative frequency.

Source
Motorcycle Helmet Use in 2005Overall Results
National Highway TrafOic Safety Administration,
August 2005

Data were collected in June of 2005 by
observing 1700 motorcyclists nationwide at
selected roadway locations. Each time a
motorcyclist passed by, the observer noted
whether the rider was wearing no helmet, a
noncompliant helmet, or a compliant helmet.

Two types of numerical data


DEFINITION

A numerical variable results in discrete data, e.g.,


counting, if the possible values of the variable correspond
to isolated points on the number line.
A numerical variable results in continuous data if the set
of possible values forms an entire interval on the number
line.

Dot plots
Simple way to display numerical data when the dataset is
reasonably small.
Each observation is represented by a dot above the location
corresponding to its value on a horizontal measurement scale.
When a value occurs more than once, there is a dot for each
occurrence and these dots are stacked vertically.

What to look for in a dot plot?


The extent to which the data values spread out.
The nature of the distribution of values along the number line.
The presence of unusual values in the dataset.

data from each college were paired

Source
Keeping Score When It Counts: Graduation Rates and Academic Progress Rates for 2009 NCAA
Mens Division I Basketball Tournament Teams
The Institute for Diversity and Ethics in Sport, University of Central Florida, March 2009

The graduation rates of basketball players were compared to those of all student
athletes for the universities and colleges that sent teams to the 2009 Division I playoffs.
The graduation rates represented the percentage of athletes who started college in 2002 who
had graduated by the end of 2008.

From the sample to the population


The objective of statistical inference is to draw conclusions about a
population using a sample from that population.
Observational study
The investigator observes characteristics of a representative sample
selected from one or more existing populations.
The goal is usually to draw conclusions about the corresponding
population or about differences between two or more populations.

Experiment
The investigator observes how a response variable behaves when one
or more explanatory variables, also called factors, are manipulated.
The composition of the groups exposed to different experimental
conditions is determined by random assignment.

Is it good health prac0ce to buy a dog?


The American Heart Associa0on recommends


addi0onal studies to determine if the improved heart
rate variability is due to dog ownership or due to the
fact that dog owners tend to get more exercise

DEFINITION

A confounding variable is one that is related to both group


membership and the response variable of interest in a research
study.

Sampling
Procedure to select a representative sample of the population

population

inference

sampling
sample

There are many reasons for selecting a representative sample


rather than obtaining information from the entire population
(census), e.g., destructive measurements, time, money.

Bias in sampling is the tendency for samples to differ from the


corresponding population in some systematic way.

Bias
Selection bias (undercoverage)
The way the sample is selected excludes automatically some parts of
the population, e.g., interviewing landline phone users only.

Measurement (or response) bias


The method of observation introduces systematic errors in the
measurement process of the attributes of interest, or, e.g., the
questions during an interview are worded in a way that tends to
inOluence the response.

Nonresponse bias
It occurs when responses are not obtained for all individuals selected
for inclusion in the sample; the nonresponse rate for surveys or
opinion polls can vary a lot, depending on how the data are collected.

Random sampling
Population
Set of objects (Oinite, countably inOinite or inOinite), some attributes of
which are needed to be investigated.
The attributes have values that can be modeled as random vectors
with distribution F (known or unknown).

Simple random sample


Subset of the population with size n. Each sample of n individuals has
the same probability of being chosen.
The attributes have values that can be considered independent,
identically distributed (i.i.d.) random vectors with distribution F
(known or unknown).

Replacement

DEFINITION

Sampling with replacement


The individuals are placed back in the population after being
selected.
Sampling without replacement
The individuals are not placed back in the population after being
selected.
It is not always feasible to replace, e.g., destructive
measurements.

The independence assumption is not valid when sampling without


replacement.

When the sample size is small relative to the population size, say
10%, the differences between the two sampling methods are small.

Other sampling methods


StratiHied sampling
The population is divided into a series of non overlapping groups,
(strata). Separate simple random samples are selected from each
stratum.
The advantage is that it is easier to make more accurate inferences
about a population when working on relatively homogeneous
subgroups.

Simple comparative experiments


DEFINITION

Explanatory variables (factors)


Variables whose values are controlled by the experimenter.

Response variables
Variables that are measured as part of the experiment.

Experimental conditions (treatments)


Any particular combination of values of the explanatory
variables.

Extraneous variables
Although not being included in the set of explanatory variables,
they are thought to affect the response variable.

Simple comparative experiments


What happens if ? What is the effect of ...?
We are interested in assessing the effects of some explanatory
variable on the response variable. For example, a medical researcher
may want to determine how a proposed treatment for a disease
compares to a standard treatment.

The problems of extraneous variables


Deal with the potential effects of extraneous variables by using
random assignment to experimental conditions and sometimes also
by incorporating direct control and/or blocking into the design of the
experiment.

Simple comparative experiments


Approach
By direct control and/or blocking, we want to prevent the situation
when the explanatory variable(s) and the extraneous variable(s) are
confounded, i.e., their effects on the response variable cannot be
distinguished from one another.

Direct control
A researcher can directly control some extraneous variables; then any
observed differences between groups could not be explained by them.

Blocking
The effects of some extraneous variables can be Oiltered out by a
process known as blocking. Blocking creates groups (called blocks)
that are similar with respect to blocking variables; then all treatments
are tried in each block.

Simple comparative experiments


Random assignment
Strategy to deal with extraneous variables that cannot be directly
controlled or are difOicult to use as blocking variables.
Random assignment ensures that the experiment does not
systematically favor one experimental condition over any other and
promotes the creation of homogenous experimental groups.

Replication
Replication is the design strategy of making multiple observations for
each experimental condition.
When an experiment can be viewed as a sequence of trials, random
assignment involves the random assignment of treatments to trials.
Random assignment either of subjects to treatments or of
treatments to trials is a critical component of a good experiment.

The Hardness Testing Example


1
Introduction to
Hardness Testing
Hardness has a variety of meanings. To the metals industry, it may be
thought of as resistance to permanent deformation. To the metallurgist, it
means resistance to penetration. To the lubrication engineer, it means resistance to wear. To the design engineer, it is a measure of flow stress. To the
mineralogist, it means resistance to scratching, and to the machinist, it
means resistance to machining. Hardness may also be referred to as mean
contact pressure. All of these characteristics are related to the plastic flow
stress of materials.

Measuring Hardness
Hardness is indicated in a variety of ways, as indicated by the names of the
tests that follow:
Static indentation tests: A ball, cone, or pyramid is forced into the surface of the metal being tested. The relationship of load to the area or
depth of indentation is the measure of hardness, such as in Brinell,
Knoop, Rockwell, and Vickers hardness tests.
Rebound tests: An object of standard mass and dimensions is bounced
from the surface of the workpiece being tested, and the height of rebound
is the measure of hardness. The Scleroscope and Leeb tests are examples.
Scratch file tests: The idea is that one material is capable of scratching
another. The Mohs and file hardness tests are examples of this type.
Plowing tests: A blunt element (usually diamond) is moved across the
surface of the workpiece being tested under controlled conditions of load

thought of as resistance to permanent deformation. To the metallurgist, it


means resistance to penetration. To the lubrication engineer, it means resistance to wear. To the design engineer, it is a measure of flow stress. To the
mineralogist, it means resistance to scratching, and to the machinist, it
means resistance to machining. Hardness may also be referred to as mean
contact pressure. All of these characteristics are related to the plastic flow
stress of materials.
1

The Hardness Testing Example


Introduction to
Hardness Testing

Measuring Hardness

Hardness is indicated in a variety of ways, as indicated by the names of the


tests
that
follow:
meanings.
To the
metals
industry, it may be

Hardness has a variety of


thought of as resistance to permanent deformation. To the metallurgist, it
means resistance to penetration. To the lubrication engineer, it means resisindentation
tests:
tance to wear. To the design engineer,it isStatic
a measure
of flow stress. To
the A ball, cone, or pyramid is
mineralogist, it means resistance to scratching,
andthe
to the
machinist,
it
face of
metal
being
tested. The relationship of
means resistance to machining. Hardness may also be referred to as mean
depth of indentation is the measure of hardness,
contact pressure. All of these characteristics are related to the plastic flow
stress of materials.
Knoop, Rockwell, and Vickers hardness tests.

forced into the surload to the area or


such as in Brinell,

Rebound tests: An object of standard mass and dimensions is bounced


from the surface of the workpiece being tested, and the height of rebound
Measuring Hardness
is the measure of hardness. The Scleroscope and Leeb tests are examHardness is indicated in a variety of ways,
as indicated by the names of the
ples.
tests that follow:
Scratch file tests: The idea is that one material is capable of scratching
Static indentation tests: A ball, cone, or pyramid is forced into the suranother. The Mohs and file hardness tests are examples of this type.
face of the metal being tested. The relationship of load to the area or
of
Plowing
A Brinell,
blunt element (usually diamond) is moved across the
depth of indentation is the measure
hardness, tests:
such as in
Knoop, Rockwell, and Vickers hardnesssurface
tests.
of the workpiece being tested under controlled conditions of load
Rebound tests: An object of standard mass and dimensions is bounced
from the surface of the workpiece being tested, and the height of rebound
is the measure of hardness. The Scleroscope and Leeb tests are examples.
Scratch file tests: The idea is that one material is capable of scratching
another. The Mohs and file hardness tests are examples of this type.
Plowing tests: A blunt element (usually diamond) is moved across the
surface of the workpiece being tested under controlled conditions of load

The Hardness Testing Example

To determine whether 4 different tips (the treatment factor) produce different


(mean) hardness readings on a hardness tester; the treatment factor is the
design of the tip for the machine that determines the hardness of metal; the tip
is one component of the testing machine

The tips are assigned to an experimental unit; that is, to a test specimen (called
a coupon), which is a piece of metal on which the tip is tested

Completely randomized experiment


Assign the tips to a random piece of metal for each test
The test specimens would be considered a source of nuisance variability

Blocked experiment
Assign all four tips to the same test specimen, randomly assigned to be tested
on a different location on the specimen; since each treatment occurs once in
each block, the number of test specimens is the number of replicates

The Hardness Testing Example

In this experiment, each specimen is called a block; thus, we have designed a


more homogenous set of experimental units on which to test the tips

Variability between blocks can be large, whereas variability within a block


should be relatively small

We are interested in testing the


equality of treatment means,
but now we have the ability to
remove the variability
associated with the nuisance
factor (the blocks) through the
grouping of the experimental
units prior to having assigned
the treatments

Unknown nuisance variables


Suppose the tensile strength testing


machine exhibits a warm-up effect such
that the longer it is on, the lower the
observed tensile strength readings will be

The warm-up effect will potentially


contaminate the tensile strength data and
destroy the validity of the experiment

Comparative bar charts and pie charts


Other ways to display categorical data (and also numerical data)
Comparative bar charts
They provide a visual comparison of two or more groups, i.e., bar
charts using the same set of horizontal and vertical axis.
Use the relative frequency, rather than the frequency, to construct the
scale on the vertical axis (fair comparisons allowed when the groups
do not have the same size)

Pie charts
Useful to display data for a relatively small number of possible
categories. They illustrate proportions of the whole dataset for
various categories. A variant of the pie chart is the segmented bar
graph (aka stacked bar graph), that used a rectangular bar rather than
a circle.

Histograms to display numerical data


How to construct
Divide the horizontal axis in a number of intervals, or bins, spanning
the whole data range (from the min to the max value in the dataset).
Choose a suitable width for each bin (all bin have equal widths).
Count the number of samples falling in each bin, and represent them
on the vertical scale (frequency or relative frequency).

What to look for


Center or typical value
Extent of spread, variability
General shape (approximation to the probability density function)
Location and number of peaks (distribution modes)
Presence of gaps and outliers.

Histograms to display numerical data


Rules of thumb
How choose the number of bins, K relative to the dataset size N?
Practical rule: !K N
It may be difOicult to approximate accurately the tail behavior of the
distribution (especially for heavy-tailed distributions).
Sometimes, a useful option is to have bin with varying width (longer
in the distribution tails).
Never reduce the bin width below the measurement resolution.

20 bins

10 bins

50 bins

density =
!

relative!frequency
bin!width

The use of the density to construct the


histogram points to its nature of
approximant of the underlying
probability density function

Histogram shapes
General shape
Sometimes emphasized by using a smoothed histogram, i.e., a
smooth curve approximating the histogram itself.

DEFINITION

Unimodal histogram
It has a single peak, (a).
Bimodal histogram
It has two peaks, (b).
Multimodal histogram
It has more than two peaks, (c).

Histogram shapes
Tails and skweness
Proceeding to the right (left) from the peak of a unimodal
histogram, we move into the upper (lower) tail of the histogram.

DEFINITION

A unimodal histogram that is not symmetric is said to be skewed.


A skewed unimodal histogram is right skewed (left skewed) if the
upper
(lower) tail is much longer than the lower (upper) tail.

Right skewness is much more frequently found than left skewness.

Displaying bivariate numerical data


Scatterplot
A bivariate dataset consists of measurements or observations of
two variables. Each observation (pair of numbers) is
represented by a point on a rectangular coordinate system.

Displaying univariate numerical data


Time series plot
Used when the observations are collected at regular intervals of
time. A variant of scatterplot, where y is the variable observed
and x is the time at which the observations were made.
Consecutive observations are connected by a line segment.

Communicating the results


Some things to remember
Select a display that is appropriate for the given type of data.
Include scales and labels on the axes of graphical displays.
In comparative plots, include labels or a legend so that it is clear
which parts of the display correspond to which samples or groups in
the data set.
In graphical displays areas should be proportional to frequency,
relative frequency, or magnitude of the number being represented.
Potential problems with:
Graphs having broken axes.
Unequal time spacing in time series plots.
Presence of patterns in scatterplots (no cause-and-effect
relationship between the two variables must be implied).

Describing the center of a dataset


Searching for a value representative of the observations
Mean
It is the familiar arithmetic average: the sum of the observations
divided by the number of observations.

!

y1 , y2 ,..., y n y =

y1 + y2 + ... + y n
n

1n
= yi
n i=1

!y = 23.10

One potential drawback to the mean as a measure of center for a


dataset is that its value can be greatly affected by the presence of even
a single outlier (an unusually large or small observation) in the dataset.

Describing the center of a dataset


Searching for a value representative of the observations
Median
Once the data values have been listed in order from smallest to largest,
the median is the middle value in the list.
n odd: the sample median is the single middle value.
n even: the sample median is the average of the two middle values.

!y = 23.10
!median = 13

The median is quite insensitive to outliers.

Describing the center of a dataset


Searching for a value representative of the observations
Trimmed mean
It is computed by Oirst ordering the data values from smallest to largest,
deleting a selected number of values from each end of the ordered list,
and Oinally averaging the remaining values.
The trimming percentage is the percentage of values deleted from each
end of the ordered list.
The trimmed mean is a compromise between the mean and the median.

Describing the center of a dataset


The median is the value on the x-axis that separates the smoothed
histogram, or the probability density function, in two parts with 0.5 of the
area under each part of the curve.




While the median and the mean have the same value for symmetric
distributions, the mean lies above (below) the median for right (left)
skewed distributions.

Describing variability in a dataset


It is important to describe how much the observations differ from
one another.
Range
DeOined as the difference between largest and smallest observations.
Variance and standard deviation
The sample variance is deOined as the sum of the squared deviations
from the sample mean divided by n - 1. The positive determination of
the square root of the sample variance is the sample standard
deviation.

y1 , y2 ,..., y n
!

y1 + y2 + ... + y n
y=
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

n

2
2
2
(
y

y)
+(
y

y)
+
...
+(
y

y)
s2 = 1
2
n

n 1

Describing variability in a dataset


It is important to describe how much the observations differ from
one another.
Interquartile range, IQR
A measure of variability that is resistant to the presence of outlying
observations.

DEFINITION

Lower quartile, Q25


Median of the lower half of the sample.
Upper quartile, Q75
Median of the upper half of the sample.

!IQR = Q75 Q25

Box and whisker plots


most extreme observation, ymax

IQR

IQ75
median, i.e., IQ50
IQ25
most extreme observation, ymin
outlier (> 1.5 IQR)

Box and whisker plots


most extreme observation, ymax

IQR

IQ75
median, i.e., IQ50
IQ25
most extreme observation, ymin
outlier (> 1.5 IQR)

Box and whisker plots

Interpreting center and variability


The mean and the standard deviation can be combined to make
informative statements about how the values in a dataset are
distributed and about the relative position of a particular value in a
dataset.

The empirical rule


If the histogram of values in a data set can be reasonably well
approximated by a normal curve, then:
Approximately 68% of the observations are within 1 standard
deviation of the mean.
Approximately 95% of the observations are within 2 standard
deviations of the mean.
Approximately 99.7% of the observations are within 3
standard deviations of the mean.

Measures of relative standing



DEFINITION

Z-score
The Z-score corresponding to a particular value is:

value#mean
z#score =
standard!deviation

!
It tells how many standard deviations the value is from the mean.
The process of subtracting the mean and then dividing by the
standard deviation is referred to as standardization.

Measures of relative standing



DEFINITION

Percentile
For any particular number r between 0 and 100, the r-th
percentile is a value such that r percent of the observations in the
dataset fall at or below that value.

Communicating the results


Some things to remember
Data distributions with different shapes can have the same
mean and standard deviation.
Both the mean and the standard deviation are sensitive to
extreme values in the dataset, especially if the sample size is
small.
Not all distributions are normal or approximately normal.
Problems of outlying observations in the dataset.
Potential problems with box and whisker plots based on small
sample sizes.

Summarizing bivariate data


We need methods for describing relationships between two numerical
variables and for assessing the strength of a relationship.
Correlation coefOicients
To assess the strength of
relationship between the x and
y values in a bivariate data set
consisting of (x, y) pairs.

Example of scatterplots:
(a)-(b) positive linear relationships;
(c) negative linear relationship;
(d) no relationship;
(e) nonlinear relationship.

Pearsons sample correlation coefOicient


To measure the strength of any
linear relationship between two
numerical variables.
It does this by using Z-scores:

xx
x zx =

sx

y zy =

y y
sy

1 n
r=
zxi z yi
n 1 i=1

Properties
1. The r-value does not depend on the unit of measurement for
either variable.
2. The r-value of is between -1 and +1. A value near the upper
(lower) limit indicates a strong positive (negative) relationship.
3. The upper (lower) limit occurs when all the points in a scatterplot
of the data lie on a straight line with positive (negative) slope.
4. The r-value is a measure of the extent to which x and y are linearly
related.
5. An r-value close to 0 does not rule out any strong relationship
between x and y; there could still be a strong relationship that is
not linear.

Correlation and causation


Association does not imply causation
An r-value close to 1 indicates that the larger values of one variable
tend to be associated with the larger values of the other variable.
It frequently happens that two variables are highly correlated not
because one is causally related to the other but because they are both
strongly related to a third variable.

Proving causality is an elusive task


ScientiOic experiments can frequently make a strong case for causality
by carefully controlling the values of all variables that might be related
to the ones under study.
In the absence of such control and ability to manipulate values of one
variable, the possibility exists that an unidentiOied underlying third
variable is inOluencing both the variables under investigation.

Linear regression: Oitting a line to bivariate data


Association does not imply causation
An r-value close to 1 indicates that the larger values of one variable
tend to be associated with the larger values of the other variable.
It frequently happens that two variables are highly correlated not
because one is causally related to the other but because they are both
strongly related to a third variable.

Proving causality is an elusive task


ScientiOic experiments can frequently make a strong case for causality
by carefully controlling the values of all variables that might be related
to the ones under study.
In the absence of such control and ability to manipulate values of one
variable, the possibility exists that an unidentiOied underlying third
variable is inOluencing both the variables under investigation.

The principle of least squares


Astringency is the characteristic of a
wine that makes the wine drinkers
mouth feel dry and puckery.
Tannins are chemical compounds
that are found in the bark and fruits
of some plants.

Pearsons correlation coefOicient: r = 0.912

The principle of least squares


n
x x1 ,x2 ,...,x n

y
=
b
x
+
a;
(
a,
b)
=
argmin
(
y

b
x

a)

i
i
i=1
a,b
y y1 , y2 ,..., y n
!

!y = b x + a















(x i x )( y i y)

b = i=1

(x i x )2

i=1

! = y b x
a

Regression
n

(x i x )( y i y) = r (n 1)s x s y
!i=1

(x i x )( y i y)

b = i=1
!

(x i x )

i=1

s
b = r y

sx

(x i x )2 = (n 1)s 2x
!i=1
sy
y = y + r (x x )
sx

!
x = x t s x y = y r t s y y t s y
!

Consider using the least-squares line to predict the value of y associated




with an x value some speciOied number of standard deviations away from x.


Then the predicted y value will be only r times this number of standard
deviations from !y . In terms of standard deviations, except when r = 1, the
predicted y will always be closer to ! y than x is to !x.










Assessing the Oit of a line


Important questions to consider:
Is a line an appropriate way to summarize the relationship
between the two variables?

Are there any unusual aspects of the dataset that we need to


consider before proceeding to use the regression line to make
predictions?

If we decide that it is reasonable to use the regression line as a


basis for prediction, how accurate can we expect predictions
based on the regression line to be?

Assessing the Oit of a line


x1 ,x2 ,...,x n
y = b x + a
y1 , y2 ,..., y n
i = b x i + a ei = y i y i
y
!

residuals

Plotting the residuals


A careful look at residuals can reveal many potential problems.

A desirable residual plot is one that exhibits no particular pattern,


such as curvature.

Curvature in the residual plot is an indication that the relationship


between x and y is not linear and that a curve would be a better
choice than a line for describing the relationship between x and y.

Plotting the residuals


Dataset: x = height (in inches) and y = weight (in pounds) for American
females, age 3039.

The scatterplot displayed


appears rather straight.
However, when the residuals
from the least-squares line
are plotted, substantial
curvature is apparent (even
though r = 0.99). It is not
accurate to say that weight
increases in direct
proportion to height (linearly
with height).

Plotting the residuals


Look for unusual values in the scatterplot or in the residual plot.

Large residuals may indicate some type of unusual behavior, such as


a recording error, a nonstandard experimental condition, or an
atypical experimental subject.

A point whose x value differs greatly from others in the data set may
have exerted excessive inOluence in determining the Oitted line,
(although it does not necessarily represent an outlier).

One method for assessing the impact of points on the Oit is to delete
them from the data set, compute the best-Oit line again, and evaluate
the extent to which the equation of the line has changed.

Plotting the residuals


DEFINITION

An observation is potentially an inOluential observation if it has an x


value that is separated from the rest of the data in the x direction).

An observation is an outlier if it has a large residual. Outlier


observations fall far away from the least-squares line in the y
direction.















CoefOicient of determination
To measure the proportion of variability in the y variable that can be
explained by a linear relationship between x and y.

DEFINITION

The coefOicient of determination, denoted by r2, gives the proportion


of variation in y that can be attributed to an approximate linear
relationship between x and y.




DEFINITION

Total sum of squares SST = ( y i y)2


i=1
!





Residual sum of squares SSRes = ( y i y i )2


i=1
!


SSRes

r = 1
SST
!
2

Standard Deviation About the LS-Line


The coefOicient of determination measures the extent of variation
about the best-Oit line relative to overall variation in y.
A high value of r2 does not by itself promise that the deviations from
the line are small in an absolute sense.
DEFINITION


The standard deviation about the least-squares line* is given
by:
SSRes
se =

n 2
!

Remind: use both r2 and se to assess the quality of a line Oit


* Later, we will understand the reason for dividing by n 2

Probability
Use the ideas and methods of the theory of probability to tackle, in a
systematic way, the study of uncertainty.

DEFINITION

Chance experiment
Experiment whose result is uncertain before it is performed (it
cannot predicted with certainty). It can be repeated many times
under the same conditions. Each single performance is called trial.






DEFINITION

Sample space
The collection of all possible outcomes of a chance experiment.



Probability
DEFINITION


Events

Any
collection of outcomes from the sample space of a chance

experiment.
Simple
event

Event
consisting of exactly one outcome.


Simple events are all the ordered pairs:
(1,1),(1,2),...,(6,5),(6,6)
!
Event consisting of all outcomes when the Oirst throw yields 1:
(1,1),(1,2),...,(1,5),(1,6)
!

Use of Venn diagrams

DEFINITION

Let A and B denote two events.


1. The event not A consists of all experimental outcomes that are not in
event A. Not A is sometimes called the complement of A and is usually
denoted by AC, or .
2. The event A or B consists of all experimental outcomes that are in at least
one of the two events. A or B is called the union of the two events and is
denoted by !A
B
.
3. The event A and B consists of all experimental outcomes that are in both
of the events A and B. A and B is called the intersection of the two events
and is denoted by !A
B .

Use of Venn diagrams

DEFINITION

Two events are disjoint, or mutually exclusive, if they do not have


outcomes in common.
Their intersection is the empty set, often indicated with the symbol !
.

Probability as a numerical measure of uncertainty


Considerations of common sense about probabilities:
They are numerical quantities, deOined on sets of outcomes
They take nonnegative values
They are additive over disjoint outcomes
They sum to 1 over all possible disjoint outcomes

Arguments for a formal deOinition
Symmetry, or exchangeability argument
Frequency argument

Probability as a numerical measure of uncertainty


Symmetry argument

Number!of!outcomes!favorable!to!E
Probability!of!E = Number!of!outcomes!in!the!sample!space
!
Common understanding (e.g., coin tossing):
Equally likely possibilities are assumed (conceptual difOiculty)

Frequency argument

Number!of!times!E !occurs
Probability!of!E
=

Number!of!trials
!
Common understanding (e.g., coin tossing):
An inVinite number of tosses are performed in an identical manner,
physically independent of each other (experimental difOiculties)

Weak law of large numbers


As the number of repetitions of a
chance experiment increases, the
chance that the relative frequency
of occurrence for an event will
differ from the true probability
of the event by more than any
small number approaches 0.

As the number of tosses


increases, the relative frequency
of heads does not continue to
Oluctuate wildly but instead
stabilizes and approaches some
limiting value.

Axiomatic deOinition of probability (in pills)


Axiomatic deOinition
sample space, X
collection of subsets of X, F
probability measure on (X, F) ! : F [0,1]

Axioms


( )
(X ) = 1
( A A ) = ( A )+ ( A )

(1) = 0
(2)
(3)
!

if A1 A2 =

Axiom 3 holds for any countable collection of disjoint sets

Some useful results





Pr(A) = 1 Pr(A)

Pr(A B) = Pr(A)+ Pr(A B)


Pr(B) = Pr(A B)+ Pr(A B)
Pr(A B) = Pr(A)+ Pr(B) Pr(A B)
Pr(A B) Pr(A)+ Pr(B)
!A B Pr(A) Pr(B)

!A

!B

!A B

Pr(A) = 1 Pr(A)
Pr(A B) = Pr(A)+ Pr(A B)

!A B

Pr(B) = Pr(A B)+ Pr(A B)


Pr(A B) = Pr(A)+ Pr(B) Pr(A B)
Pr(A B) Pr(A)+ Pr(B)
!A B Pr(A) Pr(B)

!A(A B)
!Pr(A B) = Pr(A)+ Pr(A B)

!A

!B

!A B

Pr(A) = 1 Pr(A)
Pr(A B) = Pr(A)+ Pr(A B)
Pr(B) = Pr(A B)+ Pr(A B)

!A B

!A B

Pr(A B) = Pr(A)+ Pr(B) Pr(A B)


Pr(A B) Pr(A)+ Pr(B)
!A B Pr(A) Pr(B)

!Pr(B) = Pr(A B)+ Pr(A B)

Conditional probability
The probability of an event A under the condition that the event
B has occurred is called the conditional probability of A given B
(or probability of A conditional on B), deOined as:

Pr A B
Pr A|B = Pr B , if Pr B 0
!

( )

( )

Relative frequency interpretation



N A B N A B / N f N A B
f N A|B =
=
=
N B
N B / N
f N B
!

Example
Events
A: {hearts, diamonds, clubs, spades aces}
B: {hearts cards}


Probability of an ace, given that the drawn card is hearts

number!of!favorable!cases!(hearts!ace) 1/52 1 Pr ( A B )
Pr ( B | A ) =
=
= =

number!of!possibilities!(ace!cards)
4 /52 4
Pr ( A )

Probability that the drawn ace card be hearts



number!of!favorable!cases!(hearts!ace) 1/52 1 Pr ( A B )
Pr ( B | A ) =
=
= =

number!of!possibilities!(ace!cards)
4 /52 4
Pr ( A )

Law of total probability


{Bn: n = 1,2,3,} is a Oinite or countable partition of a probability
space and each set Bn is measurable

n
Partition Bi = X , Bi B j = , i j
!i=1

n
n
For any event A: Pr(A) = Pr A Bn = Pr A|Bn Pr Bn
i=1
i=1
!

B3

B4

B5

B2
B1

B6

) ( )

Independent events
Two events A and B are said to be independent if the probability
of occurrence of one event is not affected by the occurrence of
the other event:

Pr A B
Pr A|B =
= Pr A

Pr B


Pr A B = Pr A Pr B
Pr B A

Pr B | A =
= Pr B

Pr A

!
The fact that the knowledge of A (B) does not change the state of
knowledge about B (A) has been expressed in terms of conditional
probabilities.

(
(

( )
( )

)
)

( )

( )

( ) ( )

Independent events
If A depends on B, then B depends on A

Pr A B
Pr B A
Pr A Pr B | A =
Pr B
Pr A|B =
Pr B
Pr A
!

Do not confuse causality and dependence
Dependence is different from causality, where A causes B, but
B does not cause A

Independence simpliOies the computation of joint probabilities:

( )

( )

if!independent

( )

product!of!probabilities
!joint!probability
Pr A B = Pr A Pr B

( ) ( )

( )

Sampling with and without replacement


Events
H1 = {the 1st card selected is hearths}
H2 = {the 2nd card selected is hearths}
H3 = {the 3rd card selected is hearths}
Sampling with replacement

520 cards

Replacing selected cards gives the same deck for each selection
!Pr(H3 ) = 0.25
Sampling without replacement

Number!of!outcomes!favorable!to!H3
11
Pr(H3 |H1 H2 ) =
=
= 0.22
Number!of!outcomes!in!the!sample!space 50

Number!of!outcomes!favorable!to!H3
13
Pr(H3 |H1 H2 ) =
=
= 0.26
Number!of!outcomes!in!the!sample!space 50
!

Sampling with and without replacement


Events
H1 = {the 1st card selected is hearths}
H2 = {the 2nd card selected is hearths}
H3 = {the 3rd card selected is hearths}
Sampling with replacement

520 cards

Replacing selected cards gives the same deck for each selection
!Pr(H3 ) = 0.25
Sampling without replacement

Number!of!outcomes!favorable!to!H3
128
Pr(H3 |H1 H2 ) =
=
= 0.247
Number!of!outcomes!in!the!sample!space 518

Number!of!outcomes!favorable!to!H3
130
Pr(H3 |H1 H2 ) =
=
= 0.251
Number!of!outcomes!in!the!sample!space 518
!

Bayes rule
Since

) ( )

) ( )

Pr A|B Pr B = Pr A B = Pr B | A Pr A
!
we have the Bayes rule:

) ( )
Pr ( B )

Pr B | A Pr A

Pr A|B =

!
For any partition of the sample space {Ai, i = 1,2,,n}

Pr Ai |B =

) ( )

Pr B | Ai Pr Ai
n

) ( )

Pr B | A j Pr A j
j=1

Bayes rule
Since

) ( )

) ( )

Pr A|B Pr B = Pr A B = Pr B | A Pr A
!
we have the Bayes rule:

) ( )
Pr ( B )

Pr B | A Pr A

Pr A|B =

!
For any partition of the sample space {Ai, i = 1,2,,n}

Pr Ai |B =

) ( )

Pr B | Ai Pr Ai
n

) ( )

Pr B | A j Pr A j
j=1

law of total probability

Bayes rule
A priori probability
Probability of event Ai, without knowing event B has occurred
Pr A
i
!

A posteriori probability
Probability of event Ai, knowing event B has occurred

Pr B | Ai Pr Ai
Pr Ai |B = n
Pr B | A j Pr A j
!
j=1

( )
(

) ( )

) ( )

Intuitively, the Bayes rule can be used to Oind the probability


of a particular cause (event Ai) among all the possible
causes (event A1, A2, , An) from the effect (event B).

Finite probability space




Sample space

A = a1 ,a2 ,...,aN
!

FA is the set of all subsets formed


with the elements of A


Singleton probabilities

Specify the probability of the singletons


(sets with exactly one element)

( )

p(a ) = Pr ai ,!!i = 1,...,N


! i

Probability of any event in FA determined via Axiom 3

} = { },!!1 M N

Pr ( B ) = Pr { } = p( )

!
B = ak ,ak ,...,ak
1

i=1

ki

i=1

ki

i=1

ki

Product of Oinite probability spaces


Joint probabilities

{ ( )

p AB = Pr ai ,b j , i = 1,...,M; j = 1,...,N
!

Marginal probabilities

{ ( )
= {Pr ( b ) ,

{
}
B = b ,b ,...,b }
! {
A = a1 ,a2 ,...,aM

}
i = 1,...,N }

Probability space A: p A = Pr ai , i = 1,...,M


!

Probability space B: pB
!

Conditional probabilities

{ ( )
{ ( )

p = Pr a |b , i = 1,...,M;!!j = 1,...,N
A|B
i
j

pB|A = Pr bi |a j , i = 1,...,N;!!j = 1,...,M
!

}
}

( )

p A = Pr a1

Marginalization

( )

( )

Pr aM

Pr ai = Pr ai ,b j
j=1
!

( )

( )
Pr ( b ) = Pr ( b |a )Pr ( a )
!
pB = Pr b1

Pr bN

i=1

Law of total probability

Pr a1 ,bN

Pr aM ,bN

Pr a ,b
1 1

p AB =

Pr aM ,b1
!

DeOinition

pB|A (ai ) = Pr b1 |ai

p (a )
B|A 1
pB|A =

pB|A (aM )

Pr ai ,b j

Pr b j |ai =
Pr ai
!

( )

Pr bN |ai

( )

p A = Pr a1

Marginalization

( )

( )

Pr aM

Pr ai = Pr ai ,b j
j=1
!

( )

( )
Pr ( b ) = Pr ( b |a )Pr ( a )
!
pB = Pr b1

Pr bN

i=1

Law of total probability

Pr a1 ,bN

Pr aM ,bN

Pr a ,b
1 1

p AB =

Pr aM ,b1
!

DeOinition

pB|A (ai ) = Pr b1 |ai

p (a )
B|A 1
pB|A =

pB|A (aM )

Pr ai ,b j

Pr b j |ai =
Pr ai
!

( )

Pr bN |ai

( )

p A = Pr a1

Marginalization

( )

( )

Pr aM

Pr ai = Pr ai ,b j
j=1
!

( )

( )
Pr ( b ) = Pr ( b |a )Pr ( a )
!
pB = Pr b1

Pr bN

i=1

Law of total probability

Pr a1 ,bN

Pr aM ,bN

Pr a ,b
1 1

p AB =

Pr aM ,b1
!

DeOinition

pB|A (ai ) = Pr b1 |ai

p (a )
B|A 1
pB|A =

pB|A (aM )

Pr ai ,b j

Pr b j |ai =
Pr ai
!

( )

Pr bN |ai

( )

p A = Pr a1

Marginalization

( )

( )

Pr aM

Pr ai = Pr ai ,b j
j=1
!

( )

( )
Pr ( b ) = Pr ( b |a )Pr ( a )
!
pB = Pr b1

Pr bN

i=1

Law of total probability

Pr a1 ,bN

Pr aM ,bN

Pr a ,b
1 1

p AB =

Pr aM ,b1
!

DeHinition

pB|A (ai ) = Pr b1 |ai

p (a )
B|A 1
pB|A =

pB|A (aM )

Pr ai ,b j

Pr b j |ai =
Pr ai
!

( )

Pr bN |ai

( )

p A = Pr a1

Marginalization

( )

( )

Pr aM

Pr ai = Pr ai ,b j
j=1
!

( )

( )
Pr ( b ) = Pr ( b |a )Pr ( a )
!
pB = Pr b1

Pr bN

i=1

Law of total probability

Pr a1 ,bN

Pr aM ,bN

Pr a ,b
1 1

p AB =

Pr aM ,b1
!

DeOinition

pB|A (ai ) = Pr b1 |ai

p (a )
B|A 1
pB|A =

pB|A (aM )

Pr ai ,b j

Pr b j |ai =
Pr ai
!

( )

Pr bN |ai

Numerical transmission system


Noise

Noise

Source

User
Transmission
channel

Transmitter

Receiver
b1

a1
Pr(a1)

a2

a
Pr(aM)

Pr b |a
1
1

pB|A =

Pr b1 |aM
!

Pr bN |a1

Pr bN |aM

)
)

Pr(b1)

b
Pr(bN)

aM

()

) (

b2

( )

bN

()

Pr C = Pr SR = ai |ST = ai Pr ST = ai Pr E = 1 Pr C
i=1
!

Numerical transmission system


S0

R0

System design

Observation

Pr R1 |S0 = 0.1

Pr R0 = 0.8
!

( )
Pr R |S ) = 0.05
! (
0

S1

R1

( )

) ( ) ( ) ( )
!! = (1 Pr ( R |S ))(1 Pr ( S )) + Pr ( R |S )Pr ( S )

Pr R0 = Pr R0 |S0 Pr S0 + Pr R0 |S1 Pr S1 =
!

Pr S1 |R1 =
!

( )
Pr ( S ) = 0.882
!
Pr S1 = 0.118

) ( ) = (1 Pr( R |S ))Pr( S )
Pr ( R )
1 Pr ( R )

Pr R1 |S1 Pr S1

Source

( )
Pr ( S |R ) = 0.993
Pr S1 |R1 = 0.559
0

Performance

( )

Pr E = 0.09
!

( )

Probability and statistics


sample space
Deduction of the mathematical models for the population
event

PROBABILITY

population

Hypothesis testing

measurement
experiment

trial

STATISTICS
Inductive inference

outcome

sample

Probability and statistics


Probability
Deductive analysis of stochastic phenomena using
probabilistic mathematical models

Statistics
Inductive reasoning, from the experimental data to the
probabilistic mathematical models that are more suited to
describe the stochastic phenomena of interest



Estimating probabilities empirically


Use observed long-run proportions to estimate probabilities:

Observe a very large number of outcomes under controlled


circumstances (trials in a repeated chance experiment)

Estimate the probability of an event to be the observed


proportion of occurrence based on the interpretation of
probability as a long-run relative frequency and on the weak law
of large numbers

Weak law of large numbers


As the number of repetitions of a chance experiment increases, the
chance that the relative frequency of occurrence for an event will
differ from the true probability of the event by more than any small
number approaches 0.

Random variables
DEFINITION
Random variable
Numerical variable whose value depends on the outcome of a
chance experiment. A random variable associates a numerical value
with each outcome of a chance experiment.

A discrete random variable can assume only values in a collection of
isolated points along the number line, typically obtained by
counting.

A continuous random variable has possible values in an entire
interval of the number line, typically obtained by measuring
uncertain physical variables.

Probability distribution (discrete RV)


DEFINITION
It can be speciOied using the probability mass function (PMF)


!PMF : pi = Pr(X = x i ), i = 1,...,N


PMF

Probability distribution (continuous RV)


DEFINITION
It can be speciOied using different descriptors, including:
the cumulative distribution function (CDF)
the probability density function (PDF)

d
PDF : pX (x) = FX (x)
: FX (x) = Pr(X x)
!CDF

dx
!

0 FX (x) 1

FX (x) = pX (v)dv

FX () = 0

pX (x) 0

FX (+) = 1

pX (x)dx = 1

FX (x1 ) FX (x 2 ) x1 < x 2

x2

Pr(x1 < X x 2 ) = FX (x 2 ) FX (x1 ) = pX (x)dx


x1
!

Expected value of a random variable


DEFINITION
Continuous!RV
Discrete!RV
N
+
E[X ]= x p
E[X ]= x pX (x)dx
i i
i=1

!
!


Fundamental theorem of the expectation











g(X )]= g(x )p


E[
i
i
i=1
!


E[ g(X )]= g(x)pX (x)dx

Linearity of the expectation operator

!E[ g(X )+ h(X )]= E[ g(X )]+ E[h(X )]

Mean value and variance of a random variable


MEAN VALUE

Discrete!RV
N
= E[X ]= x p
X
i i

i=1
!

Continuous!RV


+

X = E[X ]= x pX (x)dx
!

VARIANCE

Discrete!RV
N

= E[(X X ) ]= (x i X )2 pi
i=1
!
2
X

Continuous!RV
+

= E[(X X ) ]= (x X )2 pX (x)dx

!
2
X

Binomial distribution
The observations described using a binomial RV are obtained as
outcomes of a chance experiment with the following conditions:
There are a Oixed number of trials, N
Each trial results in one of only two possible outcomes, labeled
success (S) or failure (F)
Outcomes of different trials are independent
The probability that a trial results in a success is the same for
each trial

Binomial distribution
Assumptions and terminology
Each repetition is called a trial
The number of trials is usually denoted N
The probability of success is usually denoted

N is Oixed in advance
is the same for every trial
The outcome of any trial does not inOluence the outcome of
any other trial
N k
Nk

X Bin(N , ) Pr X = k | =

(1

Binomial coefOicient

N
N!
=

k!(N k)! It is read N choose k (the number of subsets of


! k
size k formed from a group of N distinct items)

Binomial distribution
! = Pr ( A) (on!a!single!trial)
One typical problem is the computation of the probability that A
occurs exactly k times out of N (independent) trials

B = A!occurs!exactly!k!times!out!of!N !trials!in!a!given!order

Nk
k
!Pr B = 1

( )

How many (disjoints) events like B exist ?



C = A!occurs!exactly!k!times!out!of!N !trials!

N k
Nk
Pr
C =
1
k
!

()

Binomial distribution

A fair die is rolled 5 times ( = 1/6; N = 5)



2
3

5 1 5
Pr "3"!shows!exactly!twice =
6 6 = 0.161

2
!


Pr "4"!shows!at!least!twice =

1 Pr "4"!does!not!show Pr "4"!shows!once =

5 1 5 5 1 5
1
6 6
6 6 = 0.196
1
! 0




Mean value of the binomial distribution


X Bin(n, )

n i
ni
X = E[X ]= i

(1

i=0 i


n!
i (1 )ni =
i
i=0 i!(n i)!
n
n1
n!
(n 1)!
i
ni
=
(1 ) = n
j (1 )n1 j = n
j=0 j!(n 1 j)!
! i=1 (i 1)!(n i)!
n

normalization of Bin(n-1, )

Variance of the binomial distribution


X Bin(n, )

n
n i
n!
ni
2
i
ni
E[X ]= i

(1

)
=
i

(1

)
=

i=0 i
i=0
i!(n i)!



n1
n!
(n 1)!
i
ni
= i
(1 ) = n ( j +1)
j (1 )n1 j =
i=1 (i 1)!(n i)!
j=0
j!(n 1 j)!
= n (n 1) +1
normalization of Bin(n-1, )
n




= E[X ] E[X ] = n 1
!
2
X

mean value of Bin(n-1, )

Limiting behavior of the binomial distribution


De Moivr-Laplace theorem
n (1 ) >> 1
2
n k

1
(k

)
nk
exp

(1 )
2n

(1

)
! k
2 n (1 )

0.25

= 0.1

N = 30

0.10

0.06



0.02

0.05

0.04

Binomial probability

0.15

0.00

0.00

Binomial probability

0.20

N = 300

4
Number of successes

10

20

30

Number of successes

40

Normal distribution
Normal distribution
2


1
(x

)
2
X N( , ) pX (x) =
exp

2

2

!
Standard normal distribution

X
X N( , 2 ) Z =
N(0,1)

!
Error function

2 x
2
erf(x) =
exp(t )dt
!
0
Cumulative distribution function

1 1 x
(x) = + erf


2
2

2
!

Normal approximation to the binomial


npq >> 1

k np
k np
n k nk
2
1
Pr k 1 k k2 =
p
q

k=k1 k
npq
npq

k2

Suppose:
k = n(p )
k
1

p k1 k k2

n
k2 = n(p + )
n
n
n
k

Pr p
1

= 2
n

pq
npq
npq




!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!









lim Pr p = 1
n
n

proportion of successes in n trials:


relative frequency


probability of success

of large numbers
Normal approximation to tWeak
he blaw
inomial

As the number of repetitions of a

chance experiment increases, the



the
k npchance

tkhat
relative frequency
k2
k nk
np
n
2
1
Pr k 1 k k2 =
p q

occurrence

of
for an event will
k=k1 k
npq
npq
differ from the true probability
Suppose:
of the event by more than any
k = n(p )
k
1

p k1 k k2 small number approaches 0.

n
k2 = n(p + )
n
n
n
k

Pr p
1

= 2
n

pq
npq
npq

npq >> 1

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!









lim Pr p = 1
n
n

proportion of successes in n trials:


relative frequency


probability of success

Uniform distribution
Uniform distribution
1

, a x b!!!!!

X U(a,b) pX (x) = b a

0,!!!!!!!!!!!elsewhere

!
Standard uniform distribution

X a
X U(a,b) Z =
U(0,1)
!
ba

Mean value and variance


1

Z = !

2
Z
U(0,1)


2 = 1
Z 12
!

Checking for normality


Normal score
Computed for a dataset of n observations
i 3/8

NSi = InvCDF
, i = 1,...,n

n +1/4

!
Normal probability plot
Sort the n observations from smallest to largest
Plot the normal scores vs. the sorted observations
In case of normality, the plot would look like a straight line

Checking for normality


Normal score
Computed for a dataset of n observations
i 3/8

NSi = InvCDF
, i = 1,...,n

n +1/4

!
Normal probability plot
Sort the n observations from smallest to largest
Plot the normal scores vs. the sorted observations
In case of normality, the plot would look like a straight line

population distribution is skewed

Checking for normality


Normal score
Computed for a dataset of n observations
i 3/8

NSi = InvCDF
, i = 1,...,n

n +1/4

!
Normal probability plot
Sort the n observations from smallest to largest
Plot the normal scores vs. the sorted observations
In case of normality, the plot would look like a straight line



heavier tails than a normal

Checking for normality


Normal score
Computed for a dataset of n observations
i 3/8

NSi = InvCDF
, i = 1,...,n

n +1/4

!
Normal probability plot
Sort the n observations from smallest to largest
Plot the normal scores vs. the sorted observations
In case of normality, the plot would look like a straight line



outlier

Normalizing transformations
What do when data samples have a distinctly non-normal shape
Use a suitable nonlinear transformation applied to data samples

Several possibilities exist, depending on the kind of non-linearity,


applying some speciOied mathematical function, e.g.:
logarithm
square root
reciprocal

Sometimes, the statistical analyses that must be done on datasets
provide good performance under the assumption that the
population (the sample) are normally distributed

Example
AGT levels in blood and urine

Several possibilities exist, depending on the kind of non-linearity




Hypothesis
The levels of a substance called AGT in blood and urine can be useful to
measure the kidney function

Study
Measurements were taken in 40 adults with chronic kidney disease
Data analysis
Sample distribution of plasma and urine AGT levels

A logarithmic
transformation is
applied to urinary
AGT data, which are
positively skewed (a
long upper tail).

The logarithm affects


values in the upper
tail much more than
in the rest of the
distribution, leading
to more symmetric
and nearly normal
distribution.

Statistics and sampling variability


Inferential statistics uses the information contained in a sample to
reach conclusions about one or more characteristics of the
population from which the sample was selected

Lets start with an example


2012 Cherry Blossom 10-mile run dataset


16.294 runners Oinished the run (population)
For each runner, sex, age, time to complete the run were available
A random sample of 100 runners (sample) will be studied

Two questions interesting to be answered:


What is the average time to complete the 10 miles?
What is the average age of the runners?

Statistics and sampling variability


DEFINITION
Statistic

Any function of the observations in a sample that does not contain

unknown parameters. It is an observable RV.

Given a random sample with size n, Y


= Y 1 ,Y
2 ,...,Y
n deOine the RVs:
!
n
sample!mean : Y = f (Y) = 1
Yi
n i=1

1 n
2
sample!variance: S = g(Y) =
(Yi Y )2
n 1 i=1

1 n
2
sample!standard!deviation : S = S =
(Yi Y )2
n 1 i=1
!
Sample mean and standard deviation are helpful to measure the
central tendency and the dispersion of the sample, respectively.

effects of statistical Oluctuations for small sample sizes

Running mean
Sequence of means, where each mean
uses one more observation in its
calculation than the mean directly before
it in the sequence.

Sample point estimates only approximate the population parameter,


and they vary from one sample to another. It will be useful to quantify
how variable an estimate is from one sample to another

effects of statistical Oluctuations for small sample sizes

In this case the entire population


can be observed without error
Running mean
Sequence of means, where each mean
uses one more observation in its
calculation than the mean directly before
it in the sequence.

Sample point estimates only approximate the population parameter,


and they vary from one sample to another. It will be useful to quantify
how variable an estimate is from one sample to another

DEFINITION

Sampling distribution
Distribution of the point estimates based on samples of a Oixed size

from
a certain population.
It
is useful to think of a particular point estimate as being drawn
from
such a distribution.

DEFINITION
Standard error of an estimate
The standard deviation
associated
with an estimate is

called the standard error.
It describes the typical error or
uncertainty associated with the
estimate.

Point estimators: some formal properties


Correctness

!bias F (Tn ) = E F [Tn (Y)] (F )


The expectation of the point estimates under the distribution F would be
equal, for either Oinite n or asymptotically, to the true value of the
parameter of interest associated to F.

bias F (Tn ) = 0
lim bias F (Tn ) = 0
!n

EfOiciency

2
2
var
(T
)
=
E
[T
(Y)](E
[T
(Y)])
F
n
F
n
! F n

Given two unbiased estimators, the best estimator is the one showing a
smaller dispersion of the estimates around the true value of the parameter
of interest associated to F, i.e., the one with smaller variance.

Sample mean

1n
Tn (Y) = Y = Yi
n i=1
!

Correctness

!bias F (Tn (Y)) = 0

Proof

n
1
1n
1
E F [Y ]= E F [ Yi ]= E F [Yi ]= n =
i=1
n
n i=1
n
!

EfOiciency


varF (Tn (Y)) =


SE =
n
n
!

Proof

2
n
1
1

2
2
var(Y ) = E [(Y E [Y ]) ]= E [(Y ) ]= n =
F
F
n
n2 i=1 F i
n2
!
2

Sample variance

1 n
Tn (Y) = S =
(Yi Y )2
n 1 i=1
!
2

Correctness

2
bias
(S
)= 0
F
!

Proof

n
1
1
2
E F [S ]=
E F [ (Yi Y ) ]=
E F [SS]= 2
n 1 i=1
n 1
!
2

Corrected sum of squares




SS = (Yi Y )

i=1

E F [SS]= E F [ Yi2 nY 2 ]
i=1

2
2
2
E F [SS]= ( + ) n + = (n 1) 2
i=1
n

!
n

Sample variance
DEFINITION
Degrees of freedom

The number of degrees of freedom of a statistic computed using n
observations is equal to the number of independent elements

Example
n

SS = (Yi Y )2
i=1

(Yi Y ) = 0
!i=1
only n-1 elements comparing in the deOinition of SS are
independent: SS has n-1 degrees of freedom

Standard error with/without replacement


n

xi

sample!mean : x = i=1
n

n
1
standard!error : =
xi x
x

n(n 1) i=1
!

( )

with!replacement:!std F x =

1
n 1
without!replacement:!std F x =
1

n
N 1
!

( )

When sampling without replacement, the standard error computed in


the same way as when sampling with replacement is overestimated.

The estimate error is small when the population size N is much larger
than the sample size n, e.g., N > 10 n.

Sampling distribution of a sample mean


Example !F N(8.25,0.75)

xi

x = i=1
n

n
1
x =
xi x
n(n 1) i=1
!

N = 500 random samples


were considered

Sampling distribution of a sample mean


Example !F F = 13

N = 500 random samples


were considered

Central Limit Theorem


Mathematical statement (quite informal)
Let Y1, Y2, , Yn be a sequence of n independent, identically
distributed (i.i.d.) random variables with Oinite mean value and
variance !
,2 then asymptotically:
n

(Yi )
Z n = i=1
N(0,1)
n
!
Interpretation
The experimental errors can be frequently
thought as arising in an additive manner from
several independent sources (normal
distribution as plausible model for the
experimental error)

General properties
Population with mean
and standard deviation ! .
Random sample (size n), with sample mean !x and sample standard
deviation
The following rules hold:
! X.

Rule!1:
X =

X = / n
!Rule!2:

Rule 3: When the population distribution is normal, the sampling
distribution of the sample mean is also normal for any sample size n.

Rule 4: (Central Limit Theorem) When n is sufOiciently large (rule of thumb:


n > 30, take care of highly skewed sample distributions), the sampling
distribution of the sample mean is well approximated by a normal.

Comment
Some further statistical tool is needed for small n.

Sampling distribution of a sample proportion


The objective of many statistical investigations is to draw a conclusion
about the proportion of individuals or objects in a population that possess a
speciOied property.

Any individual or object that possesses the property of interest is labeled a


success (S), otherwise it is labeled a failure (F).

The proportion of successes in the population is named p, whose value is


usually unknown to an investigator.

The statistic that provides a basis for making inferences about p is the
sample proportion of successes in a random sample of size n:

number!of!S's!in!the!sample
=
p
!
n

Sampling distribution of a sample proportion


Example

College student population


18.516 students, of whom
8091 (43.7%) are female.
The proportion of Ss in the
population is p = 0.437.

N = 500 random samples


were considered

General properties
Population whose proportion of successes is p.
Random sample (size n), with sample mean !p and sample standard
deviation ! P .

The following rules hold:


Rule!1: = p
P

p(1 p)
Rule!2:

=

P
n
!
Rule 3. (De Moivr-Laplace theorem) When n is large and p is not too near 0
or 1, the sampling distribution is approximately normal. Conservative rule
of thumb: np > 10, n(1 p) > 10.


ConOidence interval (CI)


DEFINITION


ConOidence interval
!p
Interval of plausible values for the characteristic of a population. It is

constructed so that, with a chosen degree of conOidence, the actual

value of the population characteristic will be between the lower and


upper endpoints of the interval.

DEFINITION

ConOidence level
The conOidence level associated with a conOidence interval estimate
is the success rate of the method used to construct the interval.

ConOidence interval
Example The standard error is the standard deviation associated with the
estimate. Roughly 95% of the time the estimate will be within 2 standard
errors of the estimate (true for normal sampling distributions).







the true mean is not covered by this CI

Common mistake incorrect interpretation of conOidence intervals:


they capture the population parameter with a certain probability

Large-sample CI for a population proportion


When n is large and the sample size is less than 10% of the population size,
the statistic:

number!in!the!sample!that!possess!the!property!of!interest
p =
n

!
has a sampling distribution that is approximately normal with

mean : P = p

p(1 p)
standard!deviation
:

=

P
n
!

ConOidence interval (95%-level)

p(1
p)
p(1
p)
p(1
p)

p
1.96

p
+1.96
or
p
1.96

n
n
n
!

This interval will capture p (and this will happen for 95% of all possible
samples). Dont say that the probability that p is in the interval is 0.95!

Large-sample CI for a population mean


When n is large and the sample size is less than 10% of the population size,
the sample mean has a sampling distribution that is approximately normal
with

mean : X =

standard!deviation : X =

n
!

ConOidence interval (95%-level)

x
1.96

x
+1.96
or
x
1.96

n
n
n

!
This interval will capture (and this will happen for 95% of all possible
samples). Dont say that the probability that
is in the interval is 0.95!


Chi-square sampling distribution


Chi-square RV (with n degrees of freedom)

1
X n2 p(x) = n/2
x n/21 exp(x /2), x > 0

2 (n /2)
!
Let X1, X2, , Xn be n i.i.d. standard normal random variables, then:

n
Z = X i2 n2
! i=1
Sample variance probability distribution
Suppose the random sample is from a population normally distributed with
variance ! 2 :
2

E[S 2 ]=
(n 1) = 2 !!!!!!!!!!!!!!
2

SS

n 1
2
S2 =

n1

4
4

n 1 n 1

var[S 2 ]=
2(n 1) =
2

n 1

(n 1)

t-Student sampling distribution


t-Student RV (with n degrees of freedom)

((k +1)/2)
1
X Tk p(x) =
2
(k+1)/2

[(x
/k)+1]
k

(k
/2)
!
2
Let Z be a standard normal RV and !
k a chi-square RV with k degrees of
freedom, we have:
Z
Y =
Tk
2
k /k
!
Small-sample conOidence interval construction
Let Y1, Y2, , Yn be a sequence of n i.i.d. random variables, we have:

E [Z ]= 0!!!!!!!!!!!!!!!!!!!!!
Tn

Y
2
Yi N( , ) Z =
Tn
n

S/ n
,n>2
varT (Z ) =
n
n2

Important properties of t-distributions


The t distribution corresponding to any particular number of degrees of
freedom is bell-shaped, symmetric and centered at zero.

Each t-distribution is more spread out than the normal distribution.


As the number of degrees of freedom increases, the spread of the


corresponding t-distribution decreases.

As the number of degrees of freedom increases, the corresponding


sequence of t-distributions approaches the normal distribution

lim Tn (x) N(0,1)
n
!


The t-Student distribution
tends to present heavier tails
than the normal distribution

Small-sample CI for a population mean


When n is not large, the sample size is less than 10% of the population size,
and the population is normal, the statistic

X
T=

S

n
!

has a sampling distribution that is t-Student (n 1 degrees of freedom).


ConOidence interval (95%-level)

s
s
s
x t n1,0.025
x +t n1,0.025
or x t n1,0.025

n
n
n
!

This interval will capture (and this will happen for 95% of all possible
samples). Dont say that the probability that
is in the interval is 0.95!

Small-sample CI for a population mean


When n is not large, the sample size is less than 10% of the population size,
and the population is normal, the statistic

X
T=

S

n
!

has a sampling distribution that is t-Student (n 1 degrees of freedom).


ConOidence interval (95%-level)

s
s
s
x t n1,0.025
x +t n1,0.025
or x t n1,0.025

n
n
n
!

This interval will capture (and this will happen for 95% of all possible
samples). Dont say that the probability that
is in the interval is 0.95!

e.g., n = 25 t 24 ,0.025 = 2.06 > z24 ,0.025 = 1.96
!


F sampling distribution
F RV (with u and v degrees of freedom)
The ratio of two independent chi-square RVs with u and v degrees of
freedom follows the F-distribution with u, v degrees of freedom:
u2 /u

Y= 2
Fu,v

! v / v

Given two independent normal populations with common variance, and


two random samples of n1 observations from the Oirst population and n2
observations from the second population; then the ratio of the sample
variances is F-distributed with n1 1, n2 1 degrees of freedom:

S12
Fn 1,n 1
2

1
2
S
2
!

Hypothesis testing
Sample data can be used to decide whether some claim or hypothesis about
a population characteristic is plausible

Hypothesis testing is a technique of statistical inference that uses sample


data to decide between two competing claims (hypotheses) about a
population characteristic.

We initially assume that a particular hypothesis, called the null hypothesis


H0, is the correct one. The evidence (the sample data) is then considered
and the null hypothesis is rejected in favor of the competing hypothesis,
called the alternative hypothesis H1, only if there is convincing evidence
against the null hypothesis

The two possible conclusions are then reject the null hypothesis or fail to
reject the null hypothesis (lack of strong support against it)

Hypothesis testing
Inference errors
Two kinds of errors can be committed when testing hypotheses:

= Pr{type!I!error} = Pr{reject!H0 |H0 is!true}

= Pr{type!II!error} = Pr{fail!to!reject!H0 |H0 is!false}

Power = 1 = Pr{reject!H0 |H0 is!false}
!
General procedure
Specify a value of the probability of type I error, often called the
signiOicance level of the test, and then design the test procedure so that the
probability of type II error has a suitably small value

After assessing the consequences of Type I and Type II errors, identify the
largest
that is tolerable for the problem. Then employ a test procedure
that uses this maximum acceptable valuerather than anything smaller
as the level of signiOicance (because using a smaller increases ! ).

Hypothesis tests for a population mean


Theory

Large'sample, known
Small&sample, unknown,!normal!population

2 T = X Tn1
1 Z = X N(0,1)

S


n
!
!

H0 : = hypothesized!value

H : hypothesized!value
! 1
Statistics

x hypothesized!value
z
=
2
1


n
!


t=

x hypothesized!value
s
n

p-value: computed as an area


under the z curve

1 z =
!

x hypothesized!value

/ n

p-value: computed as an area under the


t curve with n 1 degrees of freedom

t=
!

x hypothesized!value
s/ n

p-value: computed as an area


under the z curve

1 z =
!

x hypothesized!value

/ n

p-value: computed as an area under the


t curve with n 1 degrees of freedom

t=
!

x hypothesized!value
s/ n

Power and probability of type II error


Power
It is the probability of correctly rejecting H0 when it is false, depends on the
true value of the population mean (of course, unknown)

!Power = 1
It is possible to get some insights into the power of a test by looking at a
number of what if scenarios

Effects of various factors on the power of a test


The larger the size of the difference between the hypothesized value
and the actual value of the population mean, the higher the power
The larger the signiOicance level, the higher the power of the test.
The larger the sample size, the higher the power of the test.



Hypothesis testing
Sample data can be used to decide whether some claim or hypothesis about
a population characteristic is plausible.

Simple comparative experiments


The aim of the experiment is to compare two conditions (aka treatments).

Completely randomized experimental design


The data are viewed as they were from a random sample from a normal
population.

Hypothesis testing
Technique of statistical inference useful to compare the two treatments,
with the knowledge of the risks associated with reaching the wrong
conclusion.


Hypothesis testing
Assumptions
Y1 = { y11 , y12 ,..., y1n } Y1 j N( 1 , 12 )
Y = { y , y ,..., y } Y N( , 2 )
2
21
22
2n
2j
2
2
!








Statistical model

y ij = i + ij , i = 1,2; j = 1,...,n

ij N(0, i2 )
!

Hypothesis testing
Statistical hypothesis
We are interested in comparing the means of the two formulations

H0 : 1 = 2
Null hypothesis

H1 : 1 2
Two-sided alternative

hypothesis
H : >

One-sided alternative 1
1
2

hypotheses H : <
1
1
2
!

Approach
Based on the available observations, compute the value of a test statistic,
the sampling distribution of which is assumed known under H0. Specify the
set of values of the test statistic that leads to rejection of H0 (critical region).


Two-sample t-test
Assumptions
The variances were unknown and identical for both formulations:

Y1 = { y11 , y12 ,..., y1n } Y1 j N( 1 , 2 )

Y2 = { y21 , y22 ,..., y2n } Y2 j N( 2 , 2 )
!

Test statistic

1
2 1
Y1 Y2 N 1 2 , +

n1 n2

!

Y1 Y2

T0 =
Tn +n 2
1
2

1 1
H0 is!true
Sp
+

n1 n2
!
2
2

(n
1)S
+(n
1)S
1
2
2
S p2 = 1

n1 + n2 2
!

Two-sample t-test
Assumptions
The variances were unknown and different for both formulations:

Y1 = { y11 , y12 ,..., y1n } Y1 j N( 1 , 12 )

Y2 = { y21 , y22 ,..., y2n } Y2 j N( 2 , 22 )
!

Test statistic

12 22
Y1 Y2 N 1 2 , +

n1 n2

!

Y Y
T = 1 2 T
0

2
2
2
2

S
S
H0 is!true
S
S

1
2
1
2
(n1 1) + +(n2 1)S22
+
n1 n2
n1 n2
!
=
2
2
2
2

Approx
S1
S2
1
1
+


n1 1 n1 n2 1 n2
!

Example
Bio-equivalence
When one drug is being tested to replace another, it is important to check
that the new drug has the same effects on the body as the old drug.
Suppose Drug A is being used to lower blood pressure. We want to test
Drug B, a cheaper generic drug, to verify whether it has the same effect on
blood pressure.

Random samples available for doing statistical inference



blood pressure, mmHg drug A
drug B (generic)

mean
130.0
123.5

standard deviation
12.6
13.5

sample size
144
16



Example
Two-sample t-test

y1 = 130
y2 = 123.5
Y1 Y2
T0 =
Tn +n 2

1
2
s1 = 12.6
s2 = 13.5
1 1
H0 is!true
Sp
+

n1 = 144
n2 = 16
n1 n2
!
!
!


2
2
(n
1)S
+(n
1)S
y1 y 2
2
1
1
2
2
Sp =
s p = 12.69 t 0 =
= 1.94

n1 + n2 2
1 1
s
+

p
n1 n2
!

To determine whether to reject H0, we would compare t0 to the t-Student
distribution with n1 + n2 2 degrees of freedom.

t /2,n +n 2 t 0.025, 158 = 1.97
1
2
!

Since t 0 < t /2,n
+n
2
H0 is not rejected at the 5% signiOicance level.
1
2
!

Example
Two-sample t-test

y1 = 130
y2 = 123.5
Y1 Y2
T0 =
Tn +n 2

1
2
s1 = 12.6
s2 = 13.5
1 1
H0 is!true
Sp
+

n1 = 144
n2 = 64
n1 n2
!
!
!


2
2
(n
1)S
+(n
1)S
y1 y 2
2
1
1
2
2
Sp =
s p = 12.88 t 0 =
= 3.36

n1 + n2 2
1 1
s
+

p
n1 n2
!

To determine whether to reject H0, we would compare t0 to the t-Student
distribution with n1 + n2 2 degrees of freedom.

t /2,n +n 2 t 0.025, 206 = 2.06
1
2
!

Since t 0 < t /2,n
+n
2
H0 is rejected at the 5% signiOicance level.
1
2
!

Choice of sample size


Operating characteristic curves
Suppose we are testing the hypotheses:
H0 : 1 = 2


H : 1 2
! 1
for the case that the two population variances are unknown but equal and
the means are not equal so that

! = 1 2 .

critical difference


d = 1 2 =
2
2
!

known standard deviation

Example: independent samples t-test


Report
There were no outliers in the data, as assessed by inspection of a
boxplot.
Scores for each level of gender were normally distributed, as assessed
by Shapiro-Wilks test (p > 0.05).
There was homogeneity of variances, as assessed by Levenes test for
equality of variances (p = 0.174).
There was a statistically signiOicant difference in mean engagement
score between males and females, with males scoring higher than
females, 0.26 (95% CI, 0.04 to 0.48), t(38) = 2.365, p = 0.023.
The Cohens d is 0.75 (measure of effect size).


Example: independent samples t-test


Compute the d-Cohen





m1 m2
(n1 1)s12 +(n2 1)s22

s pooled =
d =
n1 + n2 2
s pooled

!


Eect size Strength

0.2
small

0.5
medium
!d = 0.748

0.8
large

Example: independent samples t-test





Reducing sample size









Example: independent samples t-test


Compute the d-Cohen










m1 m2
(n1 1)s12 +(n2 1)s22
s pooled =
d =
n1 + n2 2
s pooled
!

d = 0.748

!d = 0.574

Example: paired samples t-test


Report
There were two outliers in the difference data, as assessed by inspection
of a boxplot. Inspection of their values did not reveal them to be extreme
and they were kept in the analysis.
The assumption of normality was not violated, as assessed by Shapiro-
Wilks test (p = 0.780).
The carbohydrate-protein drink elicited an increase of 0.136 (95% CI,
0.091 to 0.180) km in the distance run in two hours compared to a
carbohydrate-only drink, t(19) = 6.352, p < 0.005.
The Cohens d is 1.42.



Example: paired samples t-test


Compute the d-Cohen






m

d=
s
!


Eect size

0.2

0.5

0.8

Strength
small
medium
large

!d = 1.42

Analysis of categorical data


Information is often collected on categorical variables. As with numerical
data, categorical datasets can be univariate, bivariate and multivariate

Univariate categorical data are most conveniently summarized in a one-


way frequency table, which consists of k cells for each categorical variable
with k possible values

Testing hypotheses about the proportion of the population that falls into
each of the possible categories

H0 : p1 = hypothesized!proportion!for!Category!1

!!!p2 = hypothesized!proportion!for!Category!2

!!!

!!!pk = hypothesized!proportion!for!Category!k

H : At!least!one!of!the!true!category!proportions!differs!
1
!!!
!!!!!!!!!!!!from!the!corresponding!hypothesized!value

Goodness-of-Oit
This statistic is a quantitative measure of the extent to which the observed
counts differ from those expected when H0 is true. For a sample with size n:

k (observed!cell!count expected!cell!count)2
2
=

i=1
expected!cell!count
!

expected!cell!count = nhypothesized!value!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!of!corresponding!population!proportion

The p-value is computed as the probability of observing a value of X2 at
least as large as the observed value when H0 is true

When H0 is correct and the sample size is sufOiciently large (expected


counts are greater than 5 for each class), we have
2 2
k1
!


Example
Top Oive US states for sales of hybrid cars in 2004

State
Observed counts
Population proportion

California
250
0.495

Virginia
056
0.103

Washington
034
0.085

Florida
033
0.240

Maryland
033
0.077

Total
406


We want to test the hypothesis that hybrid sales for these Oive states are
proportional to the population for these states


Example
Top Oive US states for sales of hybrid cars in 2004

State
Observed counts
Expected counts

California
250
406(0.495) = 200.970

Virginia
056
406(0.103) = 41.8180

Washington
034
406(0.085) = 34.5100

Florida
033
406(0.240) = 97.4400

Maryland
033
406(0.077) = 31.3620

Total
406


We want to test the hypothesis that hybrid sales for these Oive states are
proportional to the population for these states


Example
Top Oive US states for sales of hybrid cars in 2004

State
Observed counts
Expected counts

California
250
406(0.495) = 200.970

Virginia
056
406(0.103) = 41.8180

Washington
034
406(0.085) = 34.5100
2
State
Contribution
t
o
X

Florida
033
406(0.240) = 97.4400
California
11.9617
Maryland
033
406(0.077) = 31.3620
Virginia
04.8096
Total
406
Washington
00.0075
Florida
042.61610

Maryland
00.0966

Total
59.4916



Example
Top Oive US states for sales of hybrid cars in 2004

State
Observed counts
Expected counts

California
250
406(0.495) = 200.970

Virginia
056
406(0.103) = 41.8180

Washington
034
406(0.085) = 34.5100
2
State
Contribution
t
o
X

Florida
033
406(0.240) = 97.4400
California
11.9617
Maryland
033
406(0.077) = 31.3620
Virginia
04.8096
Total
406
Washington
00.0075
Florida
042.61610
X2 with 4 dof

Maryland
00.0966

Shaded area = 3.7e-12
Total
59.4916

(250 200.97)2
= 11.9617

200.97
!
59.4916

Example
Top Oive US states for sales of hybrid cars in 2004
State
Contribution to X2 Observed counts
California
11.9617
250

Virginia
04.8096
056

Washington
00.0075
034

Florida
042.61610
033

Maryland
00.0966
033

Total
59.4916
406

Expected counts
200.970
041.818
034.510
097.440
031.362
406.000

The highest contributions to X2 are from California, whose sales are higher
than expected, and from Florida, whose sales are lower than expected



Analysis of categorical data


Information is often collected on categorical variables. As with numerical
data, categorical datasets can be univariate, bivariate and multivariate

Bivariate categorical data are most conveniently summarized in a two-way


frequency table, or contingency table, which consists of m rows and n
columns for categorical variables with m and possible values, respectively

Marginal totals are obtained by adding the observed cell counts in each row
and in each column

Example: random sample of pet owners



Dog Cat Marginal total

Male
42 10
052

Female
09 39
048

Marginal total 51 49
100


Analysis of categorical data


Assumptions

The data are from independently chosen random samples or from subjects
who were assigned at random to treatment groups

Assumption of large sample size: all expected counts are at least 5. If some
expected counts are less than 5, rows or columns of the table may be
combined to achieve a table with expected counts at least 5

Hypothesis testing
Testing for homogeneity
Are the category proportions the same for all the populations or
treatments? (classiOication according to a single categorical variable)
Testing for independence of two categorical variables
Association between two categorical variables in a single population is
looked for (classiOication according to two categorical variables)

Test of homogeneity
H0: The true category proportions are the same for all the populations or
treatments (homogeneity of populations or treatments).
H1: The true category proportions are not all the same for all of the
populations or treatments.

Test statistic
(observed!cell!count expected!cell!count)2

2
=
all!cells
expected!cell!count
!

(row!marginal)(column!marginal)
expected!cell!count
=

grand!total
!
p-values
When H0 is true and the assumptions of the X2 test are satisOied,
2
2 (number!of!rows1)(number!of!columns1)
!
The p-value associated with the computed test statistic value is the area to
the right of X2 under the chi-square curve with the appropriate dof.

Test for independence


H0: The two variables are independent.
H1: The two variables are not independent.

Test statistic
(observed!cell!count expected!cell!count)2

2
=
all!cells
expected!cell!count
!

(row!marginal)(column!marginal)
expected!cell!count
=

grand!total
!
p-values
When H0 is true and the assumptions of the X2 test are satisOied,
2
2 (number!of!rows1)(number!of!columns1)
!
The p-value associated with the computed test statistic value is the area to
the right of X2 under the chi-square curve with the appropriate dof.



Computing the expected cell counts


Test of homogeneity
The expected cell count for a cell corresponding to a particular combination
of values of the two categorical variables can be computed as follows:
(row!marginal)(column!marginal)

expected!cell!count =

grand!total
!
These quantities represent what would be expected when there is no
difference between the groups under study.

Test for independence


The expected cell count for a cell corresponding to a particular combination
of values of the two categorical variables can be computed as follows:

proportion!of!individuals
proportion!in
proportion!in

in!a!particular!category
=
specified!category
specified!category

combination
of!first!variable of!second!variable
!

Computing the expected cell counts




observed!number observed!number

in!category
in!category


of!first!variable of!second!variable
expected!cell!count
= sample!size

sample!size
sample!size
!



proportion!of!individuals
proportion!in
proportion!in

in!a!particular!category
=
specified!category
specified!category

combination

of!first!variable of!second!variable
!

Computing the expected cell counts




row marginal total column marginal total

observed!number observed!number

in!category
in!category


of!first!variable of!second!variable
expected!cell!count
= sample!size

sample!size
sample!size
!
grand total



proportion!of!individuals
proportion!in
proportion!in

in!a!particular!category
=
specified!category
specified!category

combination

of!first!variable of!second!variable
!

Example



Dog Cat

Marginal total

Male

42

10

052

Female

09

39

048

Marginal total

51

49

100



Dog
Cat

Male
42 (26.52) 10 (25.48)

Female
09 (24.48)
39 (23.52)


(42 26.52)2 (10 25.48)2 (9 24.48)2 (39 23.52)2
2
+
+
+
=
=
26.52
25.48
24.48
23.52
= 9.0358 + 9.4046 + 9.7888 +10.1884 = 38.4176
!

p-value = 1 cdf(chi2, 38.4176, 1) = 5.7115e-10

Example



Dog Cat

Marginal total

Male

27

25

052

Female

24

24

048

Marginal total

51

49

100



Dog
Cat

Male
27 (26.52) 25 (25.48)

Female
24 (24.48)
24 (23.52)


(27 26.52)2 (25 25.48)2 (24 24.48)2 (24 23.52)2
2
+
+
+
=
=
26.52
25.48
24.48
23.52
= 0.0087 + 0.0090+ 0.0094 + 0.0098 = 0.0369
!

p-value = 1 cdf(chi2, 0.0369, 1) = 0.8477

Calculation

! = 20.6 p-value = 1 cdf(chi2, 20.6, 4) = 3.8e-4

Interpretation
evidence to support the claim that the proportions in the number of
Strong

concussions
categories are not the same for the three groups compared.

Simple linear correlation and regression


Simple linear regression model
Existence of a line with intercept
and slope
, called the population
regression line:

!y = + x + e
The inclusion of the random deviation e in the model equation recognizes
that points will deviate from the line by a random amount.

Basic assumptions
The random deviation is normal: !E N(0, 2 )
The standard deviation of e does not depend on the value of x
The random deviations e1, e2, . . . , en associated with different
observations are independent of one another.


Simple linear correlation and regression
















2
Y

N(

x
,

)
!

Simple linear correlation and regression


Some commonly encountered patterns in scatterplots









variability in y
consistent
with the
non-linear

changes with x
simple
linear
probabilistic model


regression
model

Be careful about the validity of the basic assumptions!

Estimating the population regression line


Simple linear regression model
Existence of a line with intercept
and slope
, called the population
regression line:

!y = + x + e
The inclusion of the random deviation e in the model equation recognizes
that points will deviate from the line by a random amount.

Basic assumptions
The random deviation is normal: !E N(0, 2 )
The standard deviation of e does not depend on the value of x
The random deviations e1, e2, . . . , en associated with different
observations are independent of one another.


Estimating the population regression line


n

(x i x )( y i y)

b = point!estimate!of! = i=1

(x i x )2

i=1

!a = point!estimate!of! = y b x
Let x* denote a speciOied value of the predictor variable x.
Then ! a + b x * has two different interpretations:
as a point estimate of the mean y-value when x = x*
as a point prediction of an individual y-value to be observed when x = x*



There is never evidence that the estimated linear relationship can be


extrapolated very far: be careful!








Estimating the standard deviation of e


Oitted or predicted values
n

SSResid = ( y i y i )2
i=1

e =
!

SSResid
n2

Properties of the sampling distribution of b


When the three basic assumptions of the simple linear regression model are
satisOied, the following statements are true:

the statistic b is normally distributed

b is an unbiased estimator of

b N( , b )

tends to be more precisely estimated when the

b =
x-values in the sample are spread out rather than
S
xx
!
when they are close together, and when little
variability exists about the population line.

t=

b
Tn2
sb

sb =
!

se
S xx

confidence!interval
!b (t/critical!value)sb

y, time to run 20-km ski race

x, run time to exhaustion, min

Hypothesis
The treadmill run time predicts the time to run a 20-km ski race in elite biathletes

Study
Measurements were taken in 11 US elite biathletes

95% conOidence interval: b t 9,0.025 sb 2.333 2.2600.591 (3.7,1)


Interpretation
We are 95% conOident that the true average decrease in ski time associated with a 1-
minute increase in treadmill time is between 1 and 3.7 minutes

standardized!residual =

residual
estimated!residual!standard!deviation

se

2
1 (x i x )
1
n
S xx

Properties of the sampling distribution of a + b x


When the three basic assumptions of the simple linear regression model are
satisOied, the following statements are true:

the statistic a + b x* is normally distributed

a + b x* is an unbiased estimator of !
+ x *

a + b x * N( + x * , a+b x * )

1 (x * x )2

more precise estimates are obtained when the
=
+
a+b x *
n
S xx
!
x*-value is close to the center of the x-values at

which observations were made

*
*
t = a + b x ( + x ) T
confidence!interval
n2
sa+b x *

a + b x * (t0critical!value)sa+b x *
!
*
2

1 (x x )
s
=
s
a+b x * e n + S
xx
!

Prediction interval for a single y


When the three basic assumptions of the simple linear regression model are
satisOied, the following statements are true:
prediction!interval
a + b x * ( + x * )

*
2
2
a + b x (t n2, /2 ) se + s a+b x *
t=
Tn2
sa+b x *
!

1 (x * x )2
sa+b x * = se
+

n
S xx
!

An interval for a single y-value, y*, is called a prediction interval (to
distinguish it from the conOidence interval for a mean y-value) with a
similar interpretation

The prediction interval and the conOidence interval are centered at the
same place. The addition of ! s e2 under the square-root symbol makes the
prediction interval wider than the conOidence interval

Because it is difOicult to measure jaw


width in living sharks, researchers would
like to determine whether it is possible to
estimate jaw width from body length,
which is more easily measured.

Because it is difOicult to measure jaw


width in living sharks, researchers would
like to determine whether it is possible to
estimate jaw width from body length,
which is more easily measured.

About the population correlation coefOicient


Two variables are independent when there is no association between them.
In general, ! = 0 is not equivalent to independence, unless the two variables
in the population follow a bivariate normal distribution

Joint probability density function for a multivariate normal distribution:


T
1


1
1
p(x) =
exp x x
n

2

2
det()
!
Joint probability density function for a bivariate normal distribution:

Covariance matrix

x

12
(symmetric, deOinite positive)
x = 1 , = 1 , = 11

x2
2
22
21

T
1


1
1
p(x) =
exp x x
2 det()
2

( )

Spherical process
T

1 = [730 1090]

0
8000
1 =
8000
0
Diagonal covariance matrix

2 = [730 1090]

0
8000
2 =

0
18500

Full covariance matrix

3 = [730 1090]

8000 8400
3 =

8400 18500
Information sharing between the two axes!

Test of independence
Hypothesis testing

H : =0

H : 0
! 1
Statistic
r

t=
Tn2
2

1 r

n2
!
Assumption
r is the correlation coefOicient for a random sample from a bivariate
normal population

Caveat
It is necessary to verify bivariate normality of the sample (not easy,
especially for small size)

The Analysis of Variance


Case study
Suppose we have a treatments or dierent levels of a single factor
that we wish to compare.

Typical data for a single-factor experiment


Treatment Observa0ons Totals Averages

y11

y12

y1n

y1.

y1.

y21

y22

y2n

y2.

y2.

ya1

ya2

yan

ya.

ya.

y..

y..

The Analysis of Variance (ANOVA)


Models for the data (eect model, or one-way ANOVA model)

overall mean

yij = + i + ij , i = 1,...,a; j = 1,...,n


!
i-th treatment eect

random error component

The random error component incorporates sources of variability


including measurement, dierences between the experimental
units to which the treatments are applied, eects of environment
variables, variability over 0me and so forth.

Assump0ons for hypothesis tes0ng


The model errors are assumed to be normally and independently
2
distributed random variables with zero mean and variance
!
(constant for all levels of the factor).

Analysis of the xed-eect model


Deni0ons
n

yi.

j=1

yi. = yij , yi. =


a

i = 1,...,a

y..

y.. = yij , y.. =


N = an
i=1 j=1
N
!
Assump0ons

i = + i

i=1

a
a

i = 0

i=1

Analysis of the xed-eect model


Hypothesis tes0ng

H0 : 1 = 2 = ... = a
H1 : i j for!at!least!one!pair!(i, j)
!
Assump0ons

Analysis of the xed-eect model


Decomposi0on of the Total Sum of Squares
Total corrected sum of squares SST
It is a measure of overall variability in the data; the sample variance
of the observa0ons can be computed by dividing SST by N-1.

SS
=
(y y.. ) = [(yi. y.. )+ (yij yi. )]
T ij
2

i=1 j=1
a

i=1 j=1

SST = n (yi. y.. ) + (yij yi. )2 = SSTreatments + SSE


i=1
i=1 j=1
!
2

Sum of squares of the dierences


between the treatment averages
and the grand average (measure

of
the dierence between
treatment means)

(yij yi. ) = 0
!j=1

Sum of squares of the


dierences of observa0ons
within treatments from the
treatment average (due
only to random error)

Analysis of the xed-eect model


!SST = SSTreatments + SSE

N-1 dof
a

N-a dof

a-1 dof
n

SSE = (yij yi. ) = (n 1)si2


2

i=1 j=1

s =
!
2
p

i=1

(n 1)s12 + (n 1)s22 +...+ (n 1)sa2


(n 1)+ (n 1)+...+ (n 1)

SSE

Na

Pooled es0mate of the common variance within treatments

SSTreatments = n (yi. y.. )


i=1

SSTreatments
a 1

(yi. y.. )2

i=1

a 1

2
estimates
under H0
n

estimates 2 under H0

Analysis of the xed-eect model


Decomposi0on of the Total Sum of Squares
SSE
2
estimates

SS
=
SS
+
SS
Treatments
E
! T
!N a

(based on the variability within treatments)

SSTreatments

! a 1

estimates 2 under H0

(based on the variability between treatments)


MS

=
E

SSE

Na

MSTreatments =
!

EF [MSE ] = 2
a

SSTreatments
a 1

EF [MSTreatments ] = 2 +

n i2
i=1

a 1

Analysis of the xed-eect model


Sta0s0cal analysis
SST
2

N1
2
SSTreatments
2

a1
2
SSE
2

Na
2

SSTreatments / (a 1)

! SSE / (N a)

SSTreatments
!
=

MSTreatments
MSE

and

SSE

are!independent

F(a 1,N a)

Reject H0 (i.e., there are dierences in the treatment means) if

f0 =

!

SSTreatments / (a 1)
SSE / (N a)

> F ,a1,Na

Example (using MATLAB)


The following example is from a study of the strength of structural
beams. The vector strength measures deflections of beams in
thousandths of an inch under 3,000 pounds of force. The vector
alloy identifies each beam as steel ('st'), alloy 1 ('al1'), or
alloy 2 ('al2'). The null hypothesis is that steel beams are equal
in strength to beams made of the two more expensive alloys.

>> strength = [82 86 79 83 84 85 86 87 74 82 78 75 76 77 79 79 77 78 82 79];
>> alloy = {'st','st','st','st','st','st','st','st',...
>>

'al1','al1','al1','al1','al1','al1',...

>>

'al2','al2','al2','al2','al2','al2'};

The p-value suggests rejec0on of the null hypothesis.


The box plot shows that steel beams deect more than beams made of
the more expensive alloys.

EsGmaGon of the model parameters


Means

yij = + i + ij

= y..
!i = yi. y.. , i = 1,...,a
2
y

N(

/ n)
i
! i.

yi. t /2, Na
!

yi. y j. t /2, Na

2MSE
N

MSE
N

i yi. + t /2, Na

i j yi. y j. + t /2, Na

MSE
N
2MSE
N

EsGmaGon of the model parameters


One-at-a-0me condence intervals

yi. y j. t /2, Na
!

2MSE
N

i j yi. y j. + t /2, Na

2MSE
N

Simultaneous condence intervals


If there are r 100(1 a) CIs, the probability that the r intervals will be
simultaneously correct is at least 100(1 ra) .

Bonferroni correc0on
To construct a set of simultaneous correct CIs, replace a/2 with a/(2r) in
the one-at-a-0me condence intervals. The method works nicely if r is not
too large.

Model adequacy checking


Viola0on of the basic assump0ons
Errors are normally independently distributed with zero mean and
constant but unknown variance.
Can be easily conducted with the analysis of the residuals

eij = yij yij


yij = + i = yi.
!

If the model is adequate the residuals are without structure.

Normality assump0on
Construct a normal probability plot of the residuals. Problems with small
samples. Moderate departure from normality does not necessarily imply a
serious viola0on of the assump0ons.
The ANOVA is robust to the normality assump0on (true signicance level
and power dier slightly from the s0pulated values, with the power being
lower)

Model adequacy checking


Outlier detec0on
Examine the standardized residuals

dij =
!

eij
MSE

N(0,1)

About 68% and 95% of the standardized residuals should fall within the
limits +/-1 +/-2.

Plot of residuals in 0me sequence


Ploang the residuals in 0me order of data collec0on is helpful in detec0ng
correla0on between the residuals, checking for viola0on of the
independence assump0on, or variance changing over 0me.

Plot of residuals versus bed values


Uncorrelated to the predicted response and to all other variables that may
inuence the course of the experiment.

Model adequacy checking


Variance homogeneity
Non-constant variance arises when the experimental error is
mul0plica0ve (error is a percentage of the scale reading). It also
arises in cases when the data follow a non normal distribu0on with
large values of skewness.

Problems especially with unbalanced designs, where the smaller
sample size higher variance leads to underes0mated coverage of
CIs

Variance-stabilizing transforma0ons

Model adequacy checking


Sta0s0cal tests for equality of variance
Non-constant variance arises when the experimental error is mul0plica0ve
(error is a percentage of the scale reading). It also arises in cases when the
data follow a non normal distribu0on with large values of skewness.

H0 : 12 = 22 = ... = a2
2
H
:
above!not!true!for!at!least!one!

i
! 1

Bartlebs test (very sensi0ve to the normality assump0on a sueful


alterna0ve is the modied Levene test a sort of ANOVA applied to
the absolute devia0on of the observa0ons in each treatment from
the treatment median)

Comparisons among treatment means


Bartlebs test

q = (N a)log10 S (ni 1)log10 Si2


2
p

i=1

1 a
1
1
c = 1+
(n
1)

(N

a)
i

3(a 1) i=1
1 a
2
Sp =
(ni 1)Si2
N a i=1
!

q
2
= 2.3026 a1
c
02 > 2 , a1
!
2
0

Normal probability plot


To assess whether data comes from a normal distribu0on.

Plot of the empirical probability


versus the data value for each
point in the data. A solid line
connects the 25th and 75th
percen0les in the data, and a
dashed line extends it to the ends
of the data. The y-axis values are
probabili0es from zero to one, but
the scale is not linear.

In a normal probability plot, if all the data points fall near the line, an
assump0on of normality is reasonable. Otherwise, the points will curve
away from the line, and an assump0on of normality is not jus0ed.

Comparisons among treatment means


Suppose that in conduc0ng a xed-eect ANOVA H0 is rejected. The
ques0on is: which means dier? Some0mes, further comparisons and
analysis among groups of treatment means may be useful.
Mul0ple comparison methods

Contrasts
H : =
0
4
5

H1 : 4 5

H0 : 4 5 = 0
H1 : 4 5 0

H0 : 1 + 2 = 4 + 5
!

= ci i
! i=1

H1 : 1 + 2 4 + 5

H0 : 1 + 2 4 5 = 0

H1 : 1 + 2 4 5 0 !

A contrast is a linear combina0on of parameters of the form

ci = 0

i=1

H0 : ci i = 0
i=1
a

H1 : ci i 0
i=1

Comparisons among treatment means

var(C) = ci2 (n 2 )

N(0,1)

TNa or

Suppose that in conduc0ng a xed-eect ANOVA H0 is rejected. The


i=1
ques0on is: which means dier? Some0mes, further comparisons and
analysis among groups of treatment means may be useful.
Mul0ple comparison methods
Contrasts can be wriben in terms of treatment totals or averages (in the
laber case we may be interested in construc0ng expressions for the
condence intervals)

c y
i=1 i i.
a

c
y

i i.
i=1

nMSE ci2

ci c =
*
i

n ci2
i=1

MSE

ci

n ci2

F1,Na

i=1

A contrast is a linear combina0on of parameters of the form

i=1

Comparisons among treatment means


Orthogonal contrasts
Two contrast with coecients ci and di are orthogonal if

ci di = 0
!i=1


For a treatments the set of a-1 orthogonal contrasts par00on the sum
of squares due to treatments into a-1 independent single-degree-of-
freedom components. Thus tests performed on orthogonal contrasts
are independent
Generally, the method of contrasts (or orthogonal contrasts) is useful
for what are called preplanned comparisons. That is, the contrasts are
specied prior to running the experiments and examining the data.

Sches method
In many situa0ons, experimenters may not know in advance which
contrasts they wish tocompare. For running a preliminary explora0on of
data, it may be useful to ocmpare any and all possible contrats between
treatmenr
means.
a

u = ciu i , u = 1,...,m
i=1

Cu = ciu yi. , u = 1,...,m SC = MSE (ciu2 / ni )


u
i=1
! i=1

It can be shown that the cri0cal value against which Cu should be
compared is


S ,u = SC

(a 1)F ,a1,Na

Cu > S ,u
!

The hypothesis that the contrast diers signicantly from zero is

Sches method
Simluaneous condence intervls The probability that they all are
simultaneously true is 1-a.



Cu S ,u u Cu + S ,u
!

Comparing pair of means


There are several possibil0es to compar all pairs of a treatment means:


H0 : i = j for!all!i j
H : i j
! 1

Tukeys Test
Tukeys Least Signicant Dierence (LSD) Method
Tukeys Honestly Signicant Dierence (HSD) Method
Duncans Mul0ple Range Test
Dunnets Test
Newman-Keuls Test
Comments
Sche conserva0vo (da non usare a meno che non si sia pianicato di
eebuare lanalisi di mol0 contras0)

Comparing pair of means


LSD liberal, forse meglio la HSD
'hsd' or 'tukey-kramer'

Use Tukey's honestly signicant dierence criterion. This is the default,
and it is based on the Studen0zed range distribu0on. It is op0mal for
balanced one-way ANOVA and similar procedures with equal sample
sizes. It has been proven to be conserva0ve for one-way ANOVA with
dierent sample sizes. According to the unproven Tukey-Kramer
conjecture, it is also accurate for problems where the quan00es being
compared are correlated, as in analysis of covariance with unbalanced
covariate values.'lsd'

Use Tukey's least signicant dierence procedure. This procedure is a
simple t-test. It is reasonable if the preliminary test (say, the one-way
ANOVA F sta0s0c) shows a signicant dierence. If it is used

Randomized Complete Block Design (RBCD)


Preliminary considera0ons
In any experiment, variability can arise from several factors and
aect the results and their interpreta0on.
Nuisance factor
Design factor that probably has an eect on the response, but we
are not interested in that eect.
Some0mes, a nuisance factor is unknown and/or uncontrollable.
Randomiza0on
Technique used to guard against nuisance factors that are unknown
and/or uncontrollable.
Blocking
Technique used when the nuisance factor is known and controllable.

Randomized Complete Block Design (RBCD)


Case study
We intend to determine whether or not four dierent 0ps produce
dierent readings on a hardness tes0ng machine.

The machine operates by pressing the 0p into a metal test coupon.


The hardness of the coupon can be determined from the depth of


the resul0ng depression.

The experimenter has decided to obtain four observa0ons for each


0p.

Sixteen experimental units are available for analysis.



Randomized Complete Block Design (RBCD)


Completely randomized experiment
Poten0al problems in this design situa0on, when, e.g., the metal
coupons dier slightly in their hardness.

Variability due to the experimental units will contribute to the


variability observed in the hardness data.

Randomized complete block design


The aim is to remove the variability due to the test coupons from
the experimental error.

The approach is to test each 0p once on each of four coupons.


The blocks form a more homogeneous experimental unit on which


to compare the 0ps.

Randomized Complete Block Design (RBCD)

Each block contain all treatments and, within each block, the order in
which the 0ps are tested is randomly determined.
The only randomiza0on of treatments is within the blocks, which
represent then a restric0on on randomiza0on.
Examples of blocking factors are batches of, e.g.:
raw material
people
0me

RBCD staGsGcal analysis


Model (xed-eects)

yij = + i + j + ij

!
eect of the i-th treatment
a

i = 0
!i=1

H0 : 1 = 2 = ... = a
H1 : at!least!one i j
H0 : 1 = 2 = ... = a = 0

i = 1,2,...,a

j = 1,2,...,b
eect of the j-th block
b

Hypothesis tes0ng

ij N(0, 2 )
!

H1 : at!least!one i 0

j = 0
!j=1

RBCD staGsGcal analysis


Equa0on of the total corrected sum of squares

SST = SSTreatments + SSBlocks + SSE



a b

SS = (y y )2 , with N 1!degrees!of!freedom
T
ij
..
i=1 j=1

a

2
SSTreatments = b (yi. y.. ) , with a 1!degrees!of!freedom
i=1

SSBlocks = a (y. j y.. )2 , with b 1!degrees!of!freedom


j=1

a b

SSE = (yij yi. y. j + y.. )2 , with(a 1)(b 1)!degrees!of!freedom


i=1 j=1
!

RBCD staGsGcal analysis


Inference


SS
SSTreatments SSBlocks
E
,
2 ,
2

2
!

independent chi-square variables


a

E[MSTreatments ] = 2 +

b i2
i=1

a 1

E[MSBlocks ] = 2 +
2
E[MS
]
=

E
!

a 2j
j=1

b 1

F0 =

MSTreatments
MSE

Fa1,(a1)(b1)

MSBlocks

F0 =
Fb1,(a1)(b1)
MSE
!

Can be used to test equality of block means?

Be careful that the blocks represent a restric0on on randomiza0on, while randomiza0on has
been applied only within blocks. This sta0s0c would not be used to test equality of block means.
It is only reasonable to look at it as an approximate procedure to inves0gate the eect of the
blocking variable.

The randomized block design reduces the amount of noise in the data
suciently for dierences among the four 0ps to be detected.

LaGn Square Design


Preliminary considera0ons
Need for using the blocking principle in a situa0on when two, or
more, nuisance factors are to guard against.
Example
Study the eects of ve dierent formula0ons of a rocket propellant
used in aircrew escape systems on the observed burning rate. Each
batch of material is only sucient to prepare ve formula0ons. The
formula0ons are prepared by ve dierent operators with dierent
skills and experience.
Approach
Test each formula0on once in each batch of material. Each
formula0on is prepared exactly once by each of ve operators.

LaGn Square Design

Nuisance factors
Batches of raw material and operators are arranged in a square.
Treatments
The formula0ons are indicated by the laGn lebers A, B, C, D, E.
Blocking in two direc0ons
Two restric0ons of randomiza0on apply in this design.

LaGn Square staGsGcal analysis


Model (xed-eects)

= 0
i
i=1
!

yijk = + i + j + k + ijk

eect of the i-th row

eect of the j-th treatment

Comment

ijk N(0, 2 )
!
i = 1,2,...,p
j = 1,2,...,p
k = 1,2,...,p

eect of the j-th column


b

j = 0
!j=1

The model is linear (addi0ve) without interac0on between rows, columns


and treatments.
To denote a par0cular observa0on, we need to specify only, e.g., i and j.
This is due to having only one treatment for each combina0on of nuisance
factors.

LaGn Square staGsGcal analysis


Equa0on of total corrected sum of squares

SST = SSRows + SSColumns + SSTreatments + SSE


!

degrees of freedom

p2 1
!

!p 1

!p 1

!p 1

Tes0ng for equality of means


F =
0

MSTreatments
MSE

Fp1,(p2)(p1)

F0 > F ,p1,(p2)(p1)


2
p
! 1 3(p 1) = (p 2)(p 1)

Between-subjects ANOVA
Introduc0on
It is used to test hypotheses about dierences between two or more
means.

Caveat
The t-test based on the standard error of the dierence between two
means can only be used to test dierences between two means.
When there are more than two means, it is possible to compare each mean
with each other means using t-tests.
Severe ina0on of the type I error rate.

Case study

Aoer coding by subtrac0ng 25 from each observa0on in our example


Factorial designs
Deni0on
In each complete trial or replica0on of the experiment, all possible
combina0ons of the levels of the factors are inves0gated. For
example, if there are a levels of factor A and b levels of factor B,
each replicate contains all ab treatment combina0ons.

Factor main eect


The change in response produced by a change in the level of the
factor

Interac0on
It occurs when the dierence in response between the levels of one
factor is not the same at all levels of the other factors.

30

low

high

40

20

20

50

low

high

Factor B

60

Factor B

high

20

40

low

high

low

Factor A

Response

Response

Factor A

high

low
Factor A

high

low
Factor A

The two-factor factorial design


Case study
An engineer is designing a babery for use in a device that will be
subjected to extreme varia0ons in temperature.

Design parameters
Plate material of the babery (three possible choices).
Lab tests performed at each of three temperatures,
consistent with the product end-use environment.
Four baberies tested at each combina0on of plate material
and temperature.

Ques0ons
Eects of material type and temperature on the babery-life.
Is there any material that would give uniformly long babery-
life regardless of temperature? (robust product design)

The two-factor factorial design


yijk
!

i = 1,...,a level!of!Factor!A!!!!!!!

j = 1,...,b level!of!Factor!B!!!!!!!!
k = 1,...,n number!of!replicates

The order in which the abn


observa0ons are taken is random
(completely randomized design)

StaGsGcal analysis
Fixed eects model

j = 0
!j=1

yijk = + i + j + ( )ij + ijk

i = 0

!i=1
Hypothesis tes0ng

ijk N(0, 2 )
!
b

( )ij = ( )ij = 0
j=1
!i=1

Row treatment

Column treatment

H0 : 1 = 2 = ... = a = 0

H0 : 1 = 2 = ... = b = 0

!H1 : i 0 for!at!least!one!i

!H1 : i 0 for!at!least!one!i

Interac0on

H0 : ( )ij = 0, i = 1,...,a; j = 1,2,...,b


H1 : ( )ij 0 for!at!least!one!i, j
!

StaGsGcal analysis
Row, column, cell and grand total and averages

y i.. = yijk , yi.. =


j=1 k=1

yi..

y . j. = yijk , y. j. =

bn
y. j.

i=1 k=1
n

yij.

k=1

yij. = yijk , yij. =


a b

an

, i = 1,...,a
, j = 1,...,b

, i = 1,...,a; j = 1,...,b
y...

y... = yijk , y... =


i=1 j=1 k=1
abn
!

StaGsGcal analysis
Equa0on of the total corrected sum of squares

SST = SS A + SSB + SS AB + SE

i=1
b

2
SS
=
bn
(y

y
)

A
i..
...

SSB = an (y. j. y... )2


j=1

a b

SS AB = n (yij. yi.. y. j. + y... )2


i=1 j=1

a b

SSE = (yijk yij. )2


i=1 j=1 k=1
!

Effect&
Degrees&of&freedom&
A"
a!!1!
B"
b!!1!
AB"interaction"
(a!!1)&(b!!1)!
Error!
a&b&(n!!1)!
Total!
a&b&n!!1!

StaGsGcal analysis
Each sum of squares divided by its degrees of freedom is a
mean square.


2
E[MS
]
=

+
A

bn i2
i=1

a 1

an

E[MSB ] = 2 +

i=1

2
i

b 1
a b

E[MS AB ] = +
2

2
E[MS
]
=

E
!

n ( )2ij
i=1 j=1

(a 1)(b 1)

MS A

Fa1,ab(n1)

MSE
MSB
MSE

Fb1,ab(n1)

MS AB
! MSE

F(a1)(b1),ab(n1)

Main eects (material type and temperature) are sta0s0cally signicant;


signicant interac0on between material type and temperature.

SSModel = SSMaterial + SSTemperature + SSInteraction

R =
2

SSModel
SST

about 77% of the variability in the babery life


is explained by the plate material of the
babery, the temperature and the material type
temperature interac0on.

Main eects (material type and temperature) are sta0s0cally signicant;


signicant interac0on exists between material type and temperature.

longer life at low temperature, regardless


of material type
material type 3 seems to win over the
range of tested temperatures

from low to intermediate temperature,


babery with material type 3 lives
longer, whilst baberies with material
types 1, 2 live shorter

from intermediate to
high temperature,
baberies with material
types 2, 3 live shorter,
whilst babery with
material type 1 does
not change

MulGple comparisons using the Tukeys test


When the interac0on is signicant, comparisons between the
means of one factor can be obscured by interac0on with the other
factor.
Fix one factor at a specic level and compare the means for the
other factor, e.g., intermediate temperature.




T0.05 = q0.05 (3,27)


!

MSE
n

= 45.47

y12. = 57.25

y22. y12. = 62.50 > T0.05

y22. = 119.75

y32. y22. = 26.00 < T0.05

!y32. = 145.75

!y32. y12. = 88.50 > T0.05

At the intermediate temperature, the mean babery life is the


same for material types 2, 3 and the mean babery life for material
type 1 is signicantly lower to both types 2 and 3.

Model adequacy checking


Primary diagnos0c tool: the residual analysis

e = yijk yijk = yijk yij


! ijk
Normal probability plot, to check Gaussianity of the residuals
Plot of the residuals vs. bed values
to check variance homogeneity

Model adequacy checking


of the residuals vs. material type
Plot
to check variance homogeneity
Plot of the residuals vs. temperature
to check variance homogeneity

possible outliers

Repeated measures ANOVA


Deni0on
Equivalent of the one-way ANOVA, but for related, non independent
groups, and it is the extension of the paired samples t-test. It is also called
within-subjects ANOVA, or ANOVA for correlated samples.
Test to detect any overall dierences between related means.
It requires an independent variable categorical (either nominal or ordinal)
and a dependent variable con0nuous (either interval or ra0o).

When to use
Studies that inves0gate changes in mean scores over three or more
0me points, or dierences in mean scores under three or more
treatments (the same subjects are being measured more than once
on the same dependent variable).

Example Inves0ga0ng the eects of a 6-month training programme on


blood pressure. Blood pressure is measured on the same subjects at
three 0me instants: pre-, midway and post-exercise interven0on.
within-subject factor

Between-subjects ANOVA
The total variability is par00oned into between-groups variability and
within-groups variability
Within-subjects ANOVA
The within-groups variability is further par00oned, making the error term
smaller

Approach
Approach
Each subject is treated as a block, i.e., each subject becomes a level of a
factor called subjects.
The variability of the within-subjects factor can be computed exactly as we
do with any between-subjects factor.
The error variability accounts only for individual variability to each
condi0on.

More powerful test?


The number of degrees of freedom of the error variability decreases
from n k (n: number of subjects; k: number of treatments) to
(n 1)(k 1) there are more subjects in the independent ANOVA
design.

RBCD

F(2, 10) = 12.53, p = .002

Independent ANOVA: F(2, 15) = 1.504, p = .254


It is usual for SSsubjects to
account for such a large
percentage of the within-
groups variability that the
reduc0on in the error
term is large enough to
more than compensate for
the loss in the degrees of
freedom.
Eta-squared
!= 0.751

Underlying assumpGons
Normality
Each level of the dependent variable needs to be approximately
normally distributed.

Sphericity
The concept of sphericity is the repeated measures equivalent of
homogeneity of variances.
Sphericity is the condi0on where the variances of the dierences
between all combina0ons of related groups (levels) are equal.
Viola0on of sphericity causes the test to become too liberal
(increase of the Type I error rate).
Tes0ng for sphericity is usually done using the Mauchlys test.

The Mauchlys test is weak since it fails to detect departures from


sphericity in small samples and over-detects them in large samples.

The Epsilon sta0s0cs indicates the extent of sphericity viola0on, and


accordingly act to reduce the number of degrees of freedom, such that
higher cri0cal values are used in the F-test to avoid increasing the Type I
error rate.

You might also like