Professional Documents
Culture Documents
to
Statistics
and
Data
Analysis
Angelo
Maria
Sabatini
Preliminary
considerations
Statistics
is
about
understanding
the
role
the
variability
plays
in
drawing
conclusions
based
on
data
(numbers
in
context)
DEFINITION
Descriptive
statistics
Branch
of
statistics
that
includes
methods
for
organizing
and
summarizing
data.
Inferential
statistics
Branch
of
statistics
that
involves
generalizing
from
a
sample
to
the
population
from
which
the
sample
was
selected
and
assessing
the
reliability
of
such
generalizations.
DEFINITION
DEFINITION
Bar
chart
Graph
of
the
(relative)
frequency
distribution
of
categorical
data.
Each
category
is
represented
by
a
bar
or
rectangle.
The
area
of
each
bar
is
proportional
to
the
corresponding
frequency
or
relative
frequency.
Source
Motorcycle
Helmet
Use
in
2005Overall
Results
National
Highway
TrafOic
Safety
Administration,
August
2005
Data
were
collected
in
June
of
2005
by
observing
1700
motorcyclists
nationwide
at
selected
roadway
locations.
Each
time
a
motorcyclist
passed
by,
the
observer
noted
whether
the
rider
was
wearing
no
helmet,
a
noncompliant
helmet,
or
a
compliant
helmet.
DEFINITION
Dot
plots
Simple
way
to
display
numerical
data
when
the
dataset
is
reasonably
small.
Each
observation
is
represented
by
a
dot
above
the
location
corresponding
to
its
value
on
a
horizontal
measurement
scale.
When
a
value
occurs
more
than
once,
there
is
a
dot
for
each
occurrence
and
these
dots
are
stacked
vertically.
Source
Keeping
Score
When
It
Counts:
Graduation
Rates
and
Academic
Progress
Rates
for
2009
NCAA
Mens
Division
I
Basketball
Tournament
Teams
The
Institute
for
Diversity
and
Ethics
in
Sport,
University
of
Central
Florida,
March
2009
The
graduation
rates
of
basketball
players
were
compared
to
those
of
all
student
athletes
for
the
universities
and
colleges
that
sent
teams
to
the
2009
Division
I
playoffs.
The
graduation
rates
represented
the
percentage
of
athletes
who
started
college
in
2002
who
had
graduated
by
the
end
of
2008.
Experiment
The
investigator
observes
how
a
response
variable
behaves
when
one
or
more
explanatory
variables,
also
called
factors,
are
manipulated.
The
composition
of
the
groups
exposed
to
different
experimental
conditions
is
determined
by
random
assignment.
DEFINITION
Sampling
Procedure
to
select
a
representative
sample
of
the
population
population
inference
sampling
sample
Bias
Selection
bias
(undercoverage)
The
way
the
sample
is
selected
excludes
automatically
some
parts
of
the
population,
e.g.,
interviewing
landline
phone
users
only.
Nonresponse
bias
It
occurs
when
responses
are
not
obtained
for
all
individuals
selected
for
inclusion
in
the
sample;
the
nonresponse
rate
for
surveys
or
opinion
polls
can
vary
a
lot,
depending
on
how
the
data
are
collected.
Random
sampling
Population
Set
of
objects
(Oinite,
countably
inOinite
or
inOinite),
some
attributes
of
which
are
needed
to
be
investigated.
The
attributes
have
values
that
can
be
modeled
as
random
vectors
with
distribution
F
(known
or
unknown).
Replacement
DEFINITION
When
the
sample
size
is
small
relative
to
the
population
size,
say
10%,
the
differences
between
the
two
sampling
methods
are
small.
DEFINITION
Response
variables
Variables
that
are
measured
as
part
of
the
experiment.
Extraneous
variables
Although
not
being
included
in
the
set
of
explanatory
variables,
they
are
thought
to
affect
the
response
variable.
Direct
control
A
researcher
can
directly
control
some
extraneous
variables;
then
any
observed
differences
between
groups
could
not
be
explained
by
them.
Blocking
The
effects
of
some
extraneous
variables
can
be
Oiltered
out
by
a
process
known
as
blocking.
Blocking
creates
groups
(called
blocks)
that
are
similar
with
respect
to
blocking
variables;
then
all
treatments
are
tried
in
each
block.
Replication
Replication
is
the
design
strategy
of
making
multiple
observations
for
each
experimental
condition.
When
an
experiment
can
be
viewed
as
a
sequence
of
trials,
random
assignment
involves
the
random
assignment
of
treatments
to
trials.
Random
assignment
either
of
subjects
to
treatments
or
of
treatments
to
trials
is
a
critical
component
of
a
good
experiment.
Measuring Hardness
Hardness is indicated in a variety of ways, as indicated by the names of the
tests that follow:
Static indentation tests: A ball, cone, or pyramid is forced into the surface of the metal being tested. The relationship of load to the area or
depth of indentation is the measure of hardness, such as in Brinell,
Knoop, Rockwell, and Vickers hardness tests.
Rebound tests: An object of standard mass and dimensions is bounced
from the surface of the workpiece being tested, and the height of rebound
is the measure of hardness. The Scleroscope and Leeb tests are examples.
Scratch file tests: The idea is that one material is capable of scratching
another. The Mohs and file hardness tests are examples of this type.
Plowing tests: A blunt element (usually diamond) is moved across the
surface of the workpiece being tested under controlled conditions of load
Measuring Hardness
The
tips
are
assigned
to
an
experimental
unit;
that
is,
to
a
test
specimen
(called
a
coupon),
which
is
a
piece
of
metal
on
which
the
tip
is
tested
Blocked
experiment
Assign
all
four
tips
to
the
same
test
specimen,
randomly
assigned
to
be
tested
on
a
different
location
on
the
specimen;
since
each
treatment
occurs
once
in
each
block,
the
number
of
test
specimens
is
the
number
of
replicates
Pie
charts
Useful
to
display
data
for
a
relatively
small
number
of
possible
categories.
They
illustrate
proportions
of
the
whole
dataset
for
various
categories.
A
variant
of
the
pie
chart
is
the
segmented
bar
graph
(aka
stacked
bar
graph),
that
used
a
rectangular
bar
rather
than
a
circle.
20 bins
10 bins
50 bins
density =
!
relative!frequency
bin!width
Histogram
shapes
General
shape
Sometimes
emphasized
by
using
a
smoothed
histogram,
i.e.,
a
smooth
curve
approximating
the
histogram
itself.
DEFINITION
Unimodal
histogram
It
has
a
single
peak,
(a).
Bimodal
histogram
It
has
two
peaks,
(b).
Multimodal
histogram
It
has
more
than
two
peaks,
(c).
Histogram
shapes
Tails
and
skweness
Proceeding
to
the
right
(left)
from
the
peak
of
a
unimodal
histogram,
we
move
into
the
upper
(lower)
tail
of
the
histogram.
DEFINITION
y1 , y2 ,..., y n y =
y1 + y2 + ... + y n
n
1n
= yi
n i=1
!y = 23.10
!y = 23.10
!median = 13
While
the
median
and
the
mean
have
the
same
value
for
symmetric
distributions,
the
mean
lies
above
(below)
the
median
for
right
(left)
skewed
distributions.
y1 , y2 ,..., y n
!
y1 + y2 + ... + y n
y=
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
n
2
2
2
(
y
y)
+(
y
y)
+
...
+(
y
y)
s2 = 1
2
n
n 1
DEFINITION
IQR
IQ75
median,
i.e.,
IQ50
IQ25
most
extreme
observation,
ymin
outlier
(>
1.5
IQR)
IQR
IQ75
median,
i.e.,
IQ50
IQ25
most
extreme
observation,
ymin
outlier
(>
1.5
IQR)
DEFINITION
Z-score
The
Z-score
corresponding
to
a
particular
value
is:
value#mean
z#score =
standard!deviation
!
It
tells
how
many
standard
deviations
the
value
is
from
the
mean.
The
process
of
subtracting
the
mean
and
then
dividing
by
the
standard
deviation
is
referred
to
as
standardization.
DEFINITION
Percentile
For
any
particular
number
r
between
0
and
100,
the
r-th
percentile
is
a
value
such
that
r
percent
of
the
observations
in
the
dataset
fall
at
or
below
that
value.
Example
of
scatterplots:
(a)-(b)
positive
linear
relationships;
(c)
negative
linear
relationship;
(d)
no
relationship;
(e)
nonlinear
relationship.
xx
x zx =
sx
y zy =
y y
sy
1 n
r=
zxi z yi
n 1 i=1
Properties
1. The
r-value
does
not
depend
on
the
unit
of
measurement
for
either
variable.
2. The
r-value
of
is
between
-1
and
+1.
A
value
near
the
upper
(lower)
limit
indicates
a
strong
positive
(negative)
relationship.
3. The
upper
(lower)
limit
occurs
when
all
the
points
in
a
scatterplot
of
the
data
lie
on
a
straight
line
with
positive
(negative)
slope.
4. The
r-value
is
a
measure
of
the
extent
to
which
x
and
y
are
linearly
related.
5. An
r-value
close
to
0
does
not
rule
out
any
strong
relationship
between
x
and
y;
there
could
still
be
a
strong
relationship
that
is
not
linear.
y
=
b
x
+
a;
(
a,
b)
=
argmin
(
y
b
x
a)
i
i
i=1
a,b
y y1 , y2 ,..., y n
!
!y = b x + a
(x i x )( y i y)
b = i=1
(x i x )2
i=1
! = y b x
a
Regression
n
(x i x )( y i y) = r (n 1)s x s y
!i=1
(x i x )( y i y)
b = i=1
!
(x i x )
i=1
s
b = r y
sx
(x i x )2 = (n 1)s 2x
!i=1
sy
y = y + r (x x )
sx
!
x = x t s x y = y r t s y y t s y
!
residuals
A
point
whose
x
value
differs
greatly
from
others
in
the
data
set
may
have
exerted
excessive
inOluence
in
determining
the
Oitted
line,
(although
it
does
not
necessarily
represent
an
outlier).
One
method
for
assessing
the
impact
of
points
on
the
Oit
is
to
delete
them
from
the
data
set,
compute
the
best-Oit
line
again,
and
evaluate
the
extent
to
which
the
equation
of
the
line
has
changed.
DEFINITION
CoefOicient
of
determination
To
measure
the
proportion
of
variability
in
the
y
variable
that
can
be
explained
by
a
linear
relationship
between
x
and
y.
DEFINITION
DEFINITION
SSRes
r = 1
SST
!
2
Probability
Use
the
ideas
and
methods
of
the
theory
of
probability
to
tackle,
in
a
systematic
way,
the
study
of
uncertainty.
DEFINITION
Chance
experiment
Experiment
whose
result
is
uncertain
before
it
is
performed
(it
cannot
predicted
with
certainty).
It
can
be
repeated
many
times
under
the
same
conditions.
Each
single
performance
is
called
trial.
DEFINITION
Sample
space
The
collection
of
all
possible
outcomes
of
a
chance
experiment.
Probability
DEFINITION
Events
Any
collection
of
outcomes
from
the
sample
space
of
a
chance
experiment.
Simple
event
Event
consisting
of
exactly
one
outcome.
Simple
events
are
all
the
ordered
pairs:
(1,1),(1,2),...,(6,5),(6,6)
!
Event
consisting
of
all
outcomes
when
the
Oirst
throw
yields
1:
(1,1),(1,2),...,(1,5),(1,6)
!
DEFINITION
DEFINITION
Frequency
argument
Number!of!times!E !occurs
Probability!of!E
=
Number!of!trials
!
Common
understanding
(e.g.,
coin
tossing):
An
inVinite
number
of
tosses
are
performed
in
an
identical
manner,
physically
independent
of
each
other
(experimental
difOiculties)
Axioms
( )
(X ) = 1
( A A ) = ( A )+ ( A )
(1) = 0
(2)
(3)
!
if A1 A2 =
Pr(A) = 1 Pr(A)
!A
!B
!A B
Pr(A) = 1 Pr(A)
Pr(A B) = Pr(A)+ Pr(A B)
!A B
!A(A B)
!Pr(A B) = Pr(A)+ Pr(A B)
!A
!B
!A B
Pr(A) = 1 Pr(A)
Pr(A B) = Pr(A)+ Pr(A B)
Pr(B) = Pr(A B)+ Pr(A B)
!A B
!A B
Conditional
probability
The
probability
of
an
event
A
under
the
condition
that
the
event
B
has
occurred
is
called
the
conditional
probability
of
A
given
B
(or
probability
of
A
conditional
on
B),
deOined
as:
Pr A B
Pr A|B = Pr B , if Pr B 0
!
( )
( )
Example
Events
A:
{hearts,
diamonds,
clubs,
spades
aces}
B:
{hearts
cards}
Probability
of
an
ace,
given
that
the
drawn
card
is
hearts
number!of!favorable!cases!(hearts!ace) 1/52 1 Pr ( A B )
Pr ( B | A ) =
=
= =
number!of!possibilities!(ace!cards)
4 /52 4
Pr ( A )
B3
B4
B5
B2
B1
B6
) ( )
Independent
events
Two
events
A
and
B
are
said
to
be
independent
if
the
probability
of
occurrence
of
one
event
is
not
affected
by
the
occurrence
of
the
other
event:
Pr A B
Pr A|B =
= Pr A
Pr B
Pr A B = Pr A Pr B
Pr B A
Pr B | A =
= Pr B
Pr A
!
The
fact
that
the
knowledge
of
A
(B)
does
not
change
the
state
of
knowledge
about
B
(A)
has
been
expressed
in
terms
of
conditional
probabilities.
(
(
( )
( )
)
)
( )
( )
( ) ( )
Independent
events
If
A
depends
on
B,
then
B
depends
on
A
Pr A B
Pr B A
Pr A Pr B | A =
Pr B
Pr A|B =
Pr B
Pr A
!
Do
not
confuse
causality
and
dependence
Dependence
is
different
from
causality,
where
A
causes
B,
but
B
does
not
cause
A
Independence
simpliOies
the
computation
of
joint
probabilities:
( )
( )
if!independent
( )
product!of!probabilities
!joint!probability
Pr A B = Pr A Pr B
( ) ( )
( )
520 cards
Replacing
selected
cards
gives
the
same
deck
for
each
selection
!Pr(H3 ) = 0.25
Sampling
without
replacement
Number!of!outcomes!favorable!to!H3
11
Pr(H3 |H1 H2 ) =
=
= 0.22
Number!of!outcomes!in!the!sample!space 50
Number!of!outcomes!favorable!to!H3
13
Pr(H3 |H1 H2 ) =
=
= 0.26
Number!of!outcomes!in!the!sample!space 50
!
520 cards
Replacing
selected
cards
gives
the
same
deck
for
each
selection
!Pr(H3 ) = 0.25
Sampling
without
replacement
Number!of!outcomes!favorable!to!H3
128
Pr(H3 |H1 H2 ) =
=
= 0.247
Number!of!outcomes!in!the!sample!space 518
Number!of!outcomes!favorable!to!H3
130
Pr(H3 |H1 H2 ) =
=
= 0.251
Number!of!outcomes!in!the!sample!space 518
!
Bayes
rule
Since
) ( )
) ( )
Pr A|B Pr B = Pr A B = Pr B | A Pr A
!
we
have
the
Bayes
rule:
) ( )
Pr ( B )
Pr B | A Pr A
Pr A|B =
!
For
any
partition
of
the
sample
space
{Ai,
i
=
1,2,,n}
Pr Ai |B =
) ( )
Pr B | Ai Pr Ai
n
) ( )
Pr B | A j Pr A j
j=1
Bayes
rule
Since
) ( )
) ( )
Pr A|B Pr B = Pr A B = Pr B | A Pr A
!
we
have
the
Bayes
rule:
) ( )
Pr ( B )
Pr B | A Pr A
Pr A|B =
!
For
any
partition
of
the
sample
space
{Ai,
i
=
1,2,,n}
Pr Ai |B =
) ( )
Pr B | Ai Pr Ai
n
) ( )
Pr B | A j Pr A j
j=1
Bayes
rule
A
priori
probability
Probability
of
event
Ai,
without
knowing
event
B
has
occurred
Pr A
i
!
A
posteriori
probability
Probability
of
event
Ai,
knowing
event
B
has
occurred
Pr B | Ai Pr Ai
Pr Ai |B = n
Pr B | A j Pr A j
!
j=1
( )
(
) ( )
) ( )
Sample space
A = a1 ,a2 ,...,aN
!
Singleton
probabilities
( )
} = { },!!1 M N
Pr ( B ) = Pr { } = p( )
!
B
= ak ,ak ,...,ak
1
i=1
ki
i=1
ki
i=1
ki
{ ( )
p AB = Pr ai ,b j , i = 1,...,M; j = 1,...,N
!
Marginal
probabilities
{ ( )
= {Pr ( b ) ,
{
}
B = b ,b ,...,b }
! {
A = a1 ,a2 ,...,aM
}
i = 1,...,N }
Probability
space
B:
pB
!
Conditional probabilities
{ ( )
{ ( )
p = Pr a |b , i = 1,...,M;!!j = 1,...,N
A|B
i
j
pB|A = Pr bi |a j , i = 1,...,N;!!j = 1,...,M
!
}
}
( )
p A = Pr a1
Marginalization
( )
( )
Pr aM
Pr ai = Pr ai ,b j
j=1
!
( )
( )
Pr ( b ) = Pr ( b |a )Pr ( a )
!
pB = Pr b1
Pr bN
i=1
Pr a1 ,bN
Pr aM ,bN
Pr a ,b
1 1
p AB =
Pr aM ,b1
!
DeOinition
p (a )
B|A 1
pB|A =
pB|A (aM )
Pr ai ,b j
Pr b j |ai =
Pr ai
!
( )
Pr bN |ai
( )
p A = Pr a1
Marginalization
( )
( )
Pr aM
Pr ai = Pr ai ,b j
j=1
!
( )
( )
Pr ( b ) = Pr ( b |a )Pr ( a )
!
pB = Pr b1
Pr bN
i=1
Pr a1 ,bN
Pr aM ,bN
Pr a ,b
1 1
p AB =
Pr aM ,b1
!
DeOinition
p (a )
B|A 1
pB|A =
pB|A (aM )
Pr ai ,b j
Pr b j |ai =
Pr ai
!
( )
Pr bN |ai
( )
p A = Pr a1
Marginalization
( )
( )
Pr aM
Pr ai = Pr ai ,b j
j=1
!
( )
( )
Pr ( b ) = Pr ( b |a )Pr ( a )
!
pB = Pr b1
Pr bN
i=1
Pr a1 ,bN
Pr aM ,bN
Pr a ,b
1 1
p AB =
Pr aM ,b1
!
DeOinition
p (a )
B|A 1
pB|A =
pB|A (aM )
Pr ai ,b j
Pr b j |ai =
Pr ai
!
( )
Pr bN |ai
( )
p A = Pr a1
Marginalization
( )
( )
Pr aM
Pr ai = Pr ai ,b j
j=1
!
( )
( )
Pr ( b ) = Pr ( b |a )Pr ( a )
!
pB = Pr b1
Pr bN
i=1
Pr a1 ,bN
Pr aM ,bN
Pr a ,b
1 1
p AB =
Pr aM ,b1
!
DeHinition
p (a )
B|A 1
pB|A =
pB|A (aM )
Pr ai ,b j
Pr b j |ai =
Pr ai
!
( )
Pr bN |ai
( )
p A = Pr a1
Marginalization
( )
( )
Pr aM
Pr ai = Pr ai ,b j
j=1
!
( )
( )
Pr ( b ) = Pr ( b |a )Pr ( a )
!
pB = Pr b1
Pr bN
i=1
Pr a1 ,bN
Pr aM ,bN
Pr a ,b
1 1
p AB =
Pr aM ,b1
!
DeOinition
p (a )
B|A 1
pB|A =
pB|A (aM )
Pr ai ,b j
Pr b j |ai =
Pr ai
!
( )
Pr bN |ai
Noise
Source
User
Transmission
channel
Transmitter
Receiver
b1
a1
Pr(a1)
a2
a
Pr(aM)
Pr b |a
1
1
pB|A =
Pr b1 |aM
!
Pr bN |a1
Pr bN |aM
)
)
Pr(b1)
b
Pr(bN)
aM
()
) (
b2
( )
bN
()
Pr C = Pr SR = ai |ST = ai Pr ST = ai Pr E = 1 Pr C
i=1
!
R0
System design
Observation
Pr R1 |S0 = 0.1
Pr R0 = 0.8
!
( )
Pr R |S ) = 0.05
! (
0
S1
R1
( )
) ( ) ( ) ( )
!! = (1 Pr ( R |S ))(1 Pr ( S )) + Pr ( R |S )Pr ( S )
Pr R0 = Pr R0 |S0 Pr S0 + Pr R0 |S1 Pr S1 =
!
Pr S1 |R1 =
!
( )
Pr ( S ) = 0.882
!
Pr S1 = 0.118
) ( ) = (1 Pr( R |S ))Pr( S )
Pr ( R )
1 Pr ( R )
Pr R1 |S1 Pr S1
Source
( )
Pr ( S |R ) = 0.993
Pr S1 |R1 = 0.559
0
Performance
( )
Pr E = 0.09
!
( )
PROBABILITY
population
Hypothesis testing
measurement
experiment
trial
STATISTICS
Inductive
inference
outcome
sample
Random
variables
DEFINITION
Random
variable
Numerical
variable
whose
value
depends
on
the
outcome
of
a
chance
experiment.
A
random
variable
associates
a
numerical
value
with
each
outcome
of
a
chance
experiment.
A
discrete
random
variable
can
assume
only
values
in
a
collection
of
isolated
points
along
the
number
line,
typically
obtained
by
counting.
A
continuous
random
variable
has
possible
values
in
an
entire
interval
of
the
number
line,
typically
obtained
by
measuring
uncertain
physical
variables.
PMF
0 FX (x) 1
FX (x) = pX (v)dv
FX () = 0
pX (x) 0
FX (+) = 1
pX (x)dx = 1
FX (x1 ) FX (x 2 ) x1 < x 2
x2
!
!
Fundamental
theorem
of
the
expectation
X = E[X ]= x pX (x)dx
!
VARIANCE
Discrete!RV
N
= E[(X X ) ]= (x i X )2 pi
i=1
!
2
X
Continuous!RV
+
= E[(X X ) ]= (x X )2 pX (x)dx
!
2
X
Binomial
distribution
The
observations
described
using
a
binomial
RV
are
obtained
as
outcomes
of
a
chance
experiment
with
the
following
conditions:
There
are
a
Oixed
number
of
trials,
N
Each
trial
results
in
one
of
only
two
possible
outcomes,
labeled
success
(S)
or
failure
(F)
Outcomes
of
different
trials
are
independent
The
probability
that
a
trial
results
in
a
success
is
the
same
for
each
trial
Binomial
distribution
Assumptions
and
terminology
Each
repetition
is
called
a
trial
The
number
of
trials
is
usually
denoted
N
The
probability
of
success
is
usually
denoted
N
is
Oixed
in
advance
is
the
same
for
every
trial
The
outcome
of
any
trial
does
not
inOluence
the
outcome
of
any
other
trial
N k
Nk
X Bin(N , ) Pr X = k | =
(1
Binomial coefOicient
N
N!
=
Binomial
distribution
! = Pr ( A) (on!a!single!trial)
One
typical
problem
is
the
computation
of
the
probability
that
A
occurs
exactly
k
times
out
of
N
(independent)
trials
B = A!occurs!exactly!k!times!out!of!N !trials!in!a!given!order
Nk
k
!Pr B = 1
( )
()
Binomial
distribution
5 1 5 5 1 5
1
6 6
6 6 = 0.196
1
! 0
n i
ni
X = E[X ]= i
(1
i=0 i
n!
i (1 )ni =
i
i=0 i!(n i)!
n
n1
n!
(n 1)!
i
ni
=
(1 ) = n
j (1 )n1 j = n
j=0 j!(n 1 j)!
! i=1 (i 1)!(n i)!
n
normalization of Bin(n-1, )
n
n i
n!
ni
2
i
ni
E[X ]= i
(1
)
=
i
(1
)
=
i=0 i
i=0
i!(n i)!
n1
n!
(n 1)!
i
ni
= i
(1 ) = n ( j +1)
j (1 )n1 j =
i=1 (i 1)!(n i)!
j=0
j!(n 1 j)!
= n (n 1) +1
normalization
of
Bin(n-1,
)
n
= E[X ] E[X ] = n 1
!
2
X
1
(k
)
nk
exp
(1 )
2n
(1
)
! k
2 n (1 )
0.25
= 0.1
N = 30
0.10
0.06
0.02
0.05
0.04
Binomial probability
0.15
0.00
0.00
Binomial probability
0.20
N = 300
4
Number of successes
10
20
30
Number of successes
40
Normal
distribution
Normal
distribution
2
1
(x
)
2
X N( , ) pX (x) =
exp
2
2
!
Standard
normal
distribution
X
X N( , 2 ) Z =
N(0,1)
!
Error
function
2 x
2
erf(x) =
exp(t )dt
!
0
Cumulative
distribution
function
1 1 x
(x) = + erf
2
2
2
!
k np
k np
n k nk
2
1
Pr k
1 k k2 =
p
q
k=k1 k
npq
npq
k2
Suppose:
k = n(p )
k
1
p k1 k k2
n
k2 = n(p + )
n
n
n
k
Pr p
1
= 2
n
pq
npq
npq
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
lim Pr p = 1
n
n
of
large
numbers
Normal
approximation
to
tWeak
he
blaw
inomial
tkhat
relative
frequency
k2
k nk
np
n
2
1
Pr k
1 k k2 =
p q
occurrence
of
for
an
event
will
k=k1 k
npq
npq
differ
from
the
true
probability
Suppose:
of
the
event
by
more
than
any
k = n(p )
k
1
n
k2 = n(p + )
n
n
n
k
Pr p
1
= 2
n
pq
npq
npq
npq >> 1
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
lim Pr p = 1
n
n
Uniform
distribution
Uniform
distribution
1
, a x b!!!!!
X U(a,b) pX (x) = b a
0,!!!!!!!!!!!elsewhere
!
Standard
uniform
distribution
X a
X U(a,b) Z =
U(0,1)
!
ba
1
Z = !
2
Z
U(0,1)
2 = 1
Z 12
!
n +1/4
!
Normal
probability
plot
Sort
the
n
observations
from
smallest
to
largest
Plot
the
normal
scores
vs.
the
sorted
observations
In
case
of
normality,
the
plot
would
look
like
a
straight
line
n +1/4
!
Normal
probability
plot
Sort
the
n
observations
from
smallest
to
largest
Plot
the
normal
scores
vs.
the
sorted
observations
In
case
of
normality,
the
plot
would
look
like
a
straight
line
n +1/4
!
Normal
probability
plot
Sort
the
n
observations
from
smallest
to
largest
Plot
the
normal
scores
vs.
the
sorted
observations
In
case
of
normality,
the
plot
would
look
like
a
straight
line
heavier
tails
than
a
normal
n +1/4
!
Normal
probability
plot
Sort
the
n
observations
from
smallest
to
largest
Plot
the
normal
scores
vs.
the
sorted
observations
In
case
of
normality,
the
plot
would
look
like
a
straight
line
outlier
Normalizing
transformations
What
do
when
data
samples
have
a
distinctly
non-normal
shape
Use
a
suitable
nonlinear
transformation
applied
to
data
samples
Example
AGT
levels
in
blood
and
urine
Hypothesis
The
levels
of
a
substance
called
AGT
in
blood
and
urine
can
be
useful
to
measure
the
kidney
function
Study
Measurements
were
taken
in
40
adults
with
chronic
kidney
disease
Data
analysis
Sample
distribution
of
plasma
and
urine
AGT
levels
A
logarithmic
transformation
is
applied
to
urinary
AGT
data,
which
are
positively
skewed
(a
long
upper
tail).
Running
mean
Sequence
of
means,
where
each
mean
uses
one
more
observation
in
its
calculation
than
the
mean
directly
before
it
in
the
sequence.
DEFINITION
Sampling
distribution
Distribution
of
the
point
estimates
based
on
samples
of
a
Oixed
size
from
a
certain
population.
It
is
useful
to
think
of
a
particular
point
estimate
as
being
drawn
from
such
a
distribution.
DEFINITION
Standard
error
of
an
estimate
The
standard
deviation
associated
with
an
estimate
is
called
the
standard
error.
It
describes
the
typical
error
or
uncertainty
associated
with
the
estimate.
bias F (Tn ) = 0
lim bias F (Tn ) = 0
!n
EfOiciency
2
2
var
(T
)
=
E
[T
(Y)](E
[T
(Y)])
F
n
F
n
! F n
Given
two
unbiased
estimators,
the
best
estimator
is
the
one
showing
a
smaller
dispersion
of
the
estimates
around
the
true
value
of
the
parameter
of
interest
associated
to
F,
i.e.,
the
one
with
smaller
variance.
Sample
mean
1n
Tn (Y) = Y = Yi
n i=1
!
Correctness
Proof
n
1
1n
1
E F [Y ]= E F [ Yi ]= E F [Yi ]= n =
i=1
n
n i=1
n
!
EfOiciency
Proof
2
n
1
1
2
2
var(Y ) = E [(Y E [Y ]) ]= E [(Y ) ]= n =
F
F
n
n2 i=1 F i
n2
!
2
Sample
variance
1 n
Tn (Y) = S =
(Yi Y )2
n 1 i=1
!
2
Correctness
2
bias
(S
)= 0
F
!
Proof
n
1
1
2
E F [S ]=
E F [ (Yi Y ) ]=
E F [SS]= 2
n 1 i=1
n 1
!
2
SS = (Yi Y )
i=1
E F [SS]= E F [ Yi2 nY 2 ]
i=1
2
2
2
E F [SS]= ( + ) n + = (n 1) 2
i=1
n
!
n
Sample
variance
DEFINITION
Degrees
of
freedom
The
number
of
degrees
of
freedom
of
a
statistic
computed
using
n
observations
is
equal
to
the
number
of
independent
elements
Example
n
SS = (Yi Y )2
i=1
(Yi Y ) = 0
!i=1
only
n-1
elements
comparing
in
the
deOinition
of
SS
are
independent:
SS
has
n-1
degrees
of
freedom
xi
sample!mean : x = i=1
n
n
1
standard!error : =
xi x
x
n(n 1) i=1
!
( )
with!replacement:!std F x =
1
n 1
without!replacement:!std F x =
1
n
N 1
!
( )
The
estimate
error
is
small
when
the
population
size
N
is
much
larger
than
the
sample
size
n,
e.g.,
N
>
10
n.
xi
x = i=1
n
n
1
x =
xi x
n(n 1) i=1
!
General
properties
Population
with
mean
and
standard
deviation
!
.
Random
sample
(size
n),
with
sample
mean
!x
and
sample
standard
deviation
The
following
rules
hold:
! X.
Rule!1:
X =
X = / n
!Rule!2:
Rule
3:
When
the
population
distribution
is
normal,
the
sampling
distribution
of
the
sample
mean
is
also
normal
for
any
sample
size
n.
Comment
Some
further
statistical
tool
is
needed
for
small
n.
The
statistic
that
provides
a
basis
for
making
inferences
about
p
is
the
sample
proportion
of
successes
in
a
random
sample
of
size
n:
number!of!S's!in!the!sample
=
p
!
n
General
properties
Population
whose
proportion
of
successes
is
p.
Random
sample
(size
n),
with
sample
mean
!p
and
sample
standard
deviation
! P .
=
P
n
!
Rule
3.
(De
Moivr-Laplace
theorem)
When
n
is
large
and
p
is
not
too
near
0
or
1,
the
sampling
distribution
is
approximately
normal.
Conservative
rule
of
thumb:
np
>
10,
n(1
p)
>
10.
ConOidence
interval
!p
Interval
of
plausible
values
for
the
characteristic
of
a
population.
It
is
constructed
so
that,
with
a
chosen
degree
of
conOidence,
the
actual
DEFINITION
ConOidence
level
The
conOidence
level
associated
with
a
conOidence
interval
estimate
is
the
success
rate
of
the
method
used
to
construct
the
interval.
ConOidence
interval
Example
The
standard
error
is
the
standard
deviation
associated
with
the
estimate.
Roughly
95%
of
the
time
the
estimate
will
be
within
2
standard
errors
of
the
estimate
(true
for
normal
sampling
distributions).
the
true
mean
is
not
covered
by
this
CI
=
P
n
!
ConOidence
interval
(95%-level)
p(1
p)
p(1
p)
p(1
p)
p
1.96
p
+1.96
or
p
1.96
n
n
n
!
This
interval
will
capture
p
(and
this
will
happen
for
95%
of
all
possible
samples).
Dont
say
that
the
probability
that
p
is
in
the
interval
is
0.95!
standard!deviation : X =
n
!
ConOidence
interval
(95%-level)
x
1.96
x
+1.96
or
x
1.96
n
n
n
!
This
interval
will
capture
(and
this
will
happen
for
95%
of
all
possible
samples).
Dont
say
that
the
probability
that
is
in
the
interval
is
0.95!
E[S 2 ]=
(n 1) = 2 !!!!!!!!!!!!!!
2
SS
n 1
2
S2 =
n1
4
4
n 1 n 1
var[S 2 ]=
2(n 1) =
2
n 1
(n 1)
(k
/2)
!
2
Let
Z
be
a
standard
normal
RV
and
!
k
a
chi-square
RV
with
k
degrees
of
freedom,
we
have:
Z
Y =
Tk
2
k /k
!
Small-sample
conOidence
interval
construction
Let
Y1,
Y2,
,
Yn
be
a
sequence
of
n
i.i.d.
random
variables,
we
have:
E [Z ]= 0!!!!!!!!!!!!!!!!!!!!!
Tn
Y
2
Yi N( , ) Z =
Tn
n
S/ n
,n>2
varT (Z ) =
n
n2
This
interval
will
capture
(and
this
will
happen
for
95%
of
all
possible
samples).
Dont
say
that
the
probability
that
is
in
the
interval
is
0.95!
This
interval
will
capture
(and
this
will
happen
for
95%
of
all
possible
samples).
Dont
say
that
the
probability
that
is
in
the
interval
is
0.95!
e.g., n = 25 t 24 ,0.025 = 2.06 > z24 ,0.025 = 1.96
!
F
sampling
distribution
F
RV
(with
u
and
v
degrees
of
freedom)
The
ratio
of
two
independent
chi-square
RVs
with
u
and
v
degrees
of
freedom
follows
the
F-distribution
with
u,
v
degrees
of
freedom:
u2 /u
Y= 2
Fu,v
! v / v
Hypothesis
testing
Sample
data
can
be
used
to
decide
whether
some
claim
or
hypothesis
about
a
population
characteristic
is
plausible
The
two
possible
conclusions
are
then
reject
the
null
hypothesis
or
fail
to
reject
the
null
hypothesis
(lack
of
strong
support
against
it)
Hypothesis
testing
Inference
errors
Two
kinds
of
errors
can
be
committed
when
testing
hypotheses:
= Pr{type!I!error} = Pr{reject!H0 |H0 is!true}
= Pr{type!II!error} = Pr{fail!to!reject!H0 |H0 is!false}
Power = 1 = Pr{reject!H0 |H0 is!false}
!
General
procedure
Specify
a
value
of
the
probability
of
type
I
error,
often
called
the
signiOicance
level
of
the
test,
and
then
design
the
test
procedure
so
that
the
probability
of
type
II
error
has
a
suitably
small
value
After
assessing
the
consequences
of
Type
I
and
Type
II
errors,
identify
the
largest
that
is
tolerable
for
the
problem.
Then
employ
a
test
procedure
that
uses
this
maximum
acceptable
valuerather
than
anything
smaller
as
the
level
of
signiOicance
(because
using
a
smaller
increases
! ).
S
n
!
!
H0 : = hypothesized!value
H : hypothesized!value
!
1
Statistics
x hypothesized!value
z
=
2
1
n
!
t=
x hypothesized!value
s
n
1
z =
!
x hypothesized!value
/ n
t=
!
x hypothesized!value
s/ n
1
z =
!
x hypothesized!value
/ n
t=
!
x hypothesized!value
s/ n
Hypothesis
testing
Sample
data
can
be
used
to
decide
whether
some
claim
or
hypothesis
about
a
population
characteristic
is
plausible.
Hypothesis
testing
Technique
of
statistical
inference
useful
to
compare
the
two
treatments,
with
the
knowledge
of
the
risks
associated
with
reaching
the
wrong
conclusion.
Hypothesis
testing
Assumptions
Y1 = { y11 , y12 ,..., y1n } Y1 j N( 1 , 12 )
Y = { y , y ,..., y } Y N( , 2 )
2
21
22
2n
2j
2
2
!
Statistical
model
y ij = i + ij , i = 1,2; j = 1,...,n
ij N(0, i2 )
!
Hypothesis
testing
Statistical
hypothesis
We
are
interested
in
comparing
the
means
of
the
two
formulations
H0 : 1 = 2
Null
hypothesis
H1 : 1 2
Two-sided
alternative
hypothesis
H : >
One-sided
alternative
1
1
2
hypotheses
H : <
1
1
2
!
Approach
Based
on
the
available
observations,
compute
the
value
of
a
test
statistic,
the
sampling
distribution
of
which
is
assumed
known
under
H0.
Specify
the
set
of
values
of
the
test
statistic
that
leads
to
rejection
of
H0
(critical
region).
Two-sample
t-test
Assumptions
The
variances
were
unknown
and
identical
for
both
formulations:
Y1 = { y11 , y12 ,..., y1n } Y1 j N( 1 , 2 )
Y2 = { y21 , y22 ,..., y2n } Y2 j N( 2 , 2 )
!
Test
statistic
1
2 1
Y1 Y2 N 1 2 , +
n1 n2
!
Y1 Y2
T0 =
Tn +n 2
1
2
1 1
H0 is!true
Sp
+
n1 n2
!
2
2
(n
1)S
+(n
1)S
1
2
2
S p2 = 1
n1 + n2 2
!
Two-sample
t-test
Assumptions
The
variances
were
unknown
and
different
for
both
formulations:
Y1 = { y11 , y12 ,..., y1n } Y1 j N( 1 , 12 )
Y2 = { y21 , y22 ,..., y2n } Y2 j N( 2 , 22 )
!
Test
statistic
12 22
Y1 Y2 N 1 2 , +
n1 n2
!
Y Y
T = 1 2 T
0
2
2
2
2
S
S
H0 is!true
S
S
1
2
1
2
(n1 1) + +(n2 1)S22
+
n1 n2
n1 n2
!
=
2
2
2
2
Approx
S1
S2
1
1
+
n1 1 n1 n2 1 n2
!
Example
Bio-equivalence
When
one
drug
is
being
tested
to
replace
another,
it
is
important
to
check
that
the
new
drug
has
the
same
effects
on
the
body
as
the
old
drug.
Suppose
Drug
A
is
being
used
to
lower
blood
pressure.
We
want
to
test
Drug
B,
a
cheaper
generic
drug,
to
verify
whether
it
has
the
same
effect
on
blood
pressure.
Example
Two-sample
t-test
y1 = 130
y2 = 123.5
Y1 Y2
T0 =
Tn +n 2
1
2
s1 = 12.6
s2 = 13.5
1 1
H0 is!true
Sp
+
n1 = 144
n2 = 16
n1 n2
!
!
!
2
2
(n
1)S
+(n
1)S
y1 y 2
2
1
1
2
2
Sp =
s p = 12.69 t 0 =
= 1.94
n1 + n2 2
1 1
s
+
p
n1 n2
!
To
determine
whether
to
reject
H0,
we
would
compare
t0
to
the
t-Student
distribution
with
n1
+
n2
2
degrees
of
freedom.
t /2,n +n 2 t 0.025, 158 = 1.97
1
2
!
Since
t
0
<
t
/2,n
+n
2
H0
is
not
rejected
at
the
5%
signiOicance
level.
1
2
!
Example
Two-sample
t-test
y1 = 130
y2 = 123.5
Y1 Y2
T0 =
Tn +n 2
1
2
s1 = 12.6
s2 = 13.5
1 1
H0 is!true
Sp
+
n1 = 144
n2 = 64
n1 n2
!
!
!
2
2
(n
1)S
+(n
1)S
y1 y 2
2
1
1
2
2
Sp =
s p = 12.88 t 0 =
= 3.36
n1 + n2 2
1 1
s
+
p
n1 n2
!
To
determine
whether
to
reject
H0,
we
would
compare
t0
to
the
t-Student
distribution
with
n1
+
n2
2
degrees
of
freedom.
t /2,n +n 2 t 0.025, 206 = 2.06
1
2
!
Since
t
0
<
t
/2,n
+n
2
H0
is
rejected
at
the
5%
signiOicance
level.
1
2
!
d = 1 2 =
2
2
!
known
standard
deviation
Reducing
sample
size
d = 0.748
!d = 0.574
Strength
small
medium
large
!d = 1.42
Testing
hypotheses
about
the
proportion
of
the
population
that
falls
into
each
of
the
possible
categories
H0 : p1 = hypothesized!proportion!for!Category!1
!!!p2 = hypothesized!proportion!for!Category!2
!!!
!!!pk = hypothesized!proportion!for!Category!k
H : At!least!one!of!the!true!category!proportions!differs!
1
!!!
!!!!!!!!!!!!from!the!corresponding!hypothesized!value
Goodness-of-Oit
This
statistic
is
a
quantitative
measure
of
the
extent
to
which
the
observed
counts
differ
from
those
expected
when
H0
is
true.
For
a
sample
with
size
n:
k (observed!cell!count expected!cell!count)2
2
=
i=1
expected!cell!count
!
expected!cell!count = nhypothesized!value!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!of!corresponding!population!proportion
The
p-value
is
computed
as
the
probability
of
observing
a
value
of
X2
at
least
as
large
as
the
observed
value
when
H0
is
true
Example
Top
Oive
US
states
for
sales
of
hybrid
cars
in
2004
State
Observed
counts
Population
proportion
California
250
0.495
Virginia
056
0.103
Washington
034
0.085
Florida
033
0.240
Maryland
033
0.077
Total
406
We
want
to
test
the
hypothesis
that
hybrid
sales
for
these
Oive
states
are
proportional
to
the
population
for
these
states
Example
Top
Oive
US
states
for
sales
of
hybrid
cars
in
2004
State
Observed
counts
Expected
counts
California
250
406(0.495)
=
200.970
Virginia
056
406(0.103)
=
41.8180
Washington
034
406(0.085)
=
34.5100
Florida
033
406(0.240)
=
97.4400
Maryland
033
406(0.077)
=
31.3620
Total
406
We
want
to
test
the
hypothesis
that
hybrid
sales
for
these
Oive
states
are
proportional
to
the
population
for
these
states
Example
Top
Oive
US
states
for
sales
of
hybrid
cars
in
2004
State
Observed
counts
Expected
counts
California
250
406(0.495)
=
200.970
Virginia
056
406(0.103)
=
41.8180
Washington
034
406(0.085)
=
34.5100
2
State
Contribution
t
o
X
Florida
033
406(0.240)
=
97.4400
California
11.9617
Maryland
033
406(0.077)
=
31.3620
Virginia
04.8096
Total
406
Washington
00.0075
Florida
042.61610
Maryland
00.0966
Total
59.4916
Example
Top
Oive
US
states
for
sales
of
hybrid
cars
in
2004
State
Observed
counts
Expected
counts
California
250
406(0.495)
=
200.970
Virginia
056
406(0.103)
=
41.8180
Washington
034
406(0.085)
=
34.5100
2
State
Contribution
t
o
X
Florida
033
406(0.240)
=
97.4400
California
11.9617
Maryland
033
406(0.077)
=
31.3620
Virginia
04.8096
Total
406
Washington
00.0075
Florida
042.61610
X2
with
4
dof
Maryland
00.0966
Shaded
area
=
3.7e-12
Total
59.4916
(250 200.97)2
= 11.9617
200.97
!
59.4916
Example
Top
Oive
US
states
for
sales
of
hybrid
cars
in
2004
State
Contribution
to
X2
Observed
counts
California
11.9617
250
Virginia
04.8096
056
Washington
00.0075
034
Florida
042.61610
033
Maryland
00.0966
033
Total
59.4916
406
Expected
counts
200.970
041.818
034.510
097.440
031.362
406.000
The
highest
contributions
to
X2
are
from
California,
whose
sales
are
higher
than
expected,
and
from
Florida,
whose
sales
are
lower
than
expected
Marginal
totals
are
obtained
by
adding
the
observed
cell
counts
in
each
row
and
in
each
column
The
data
are
from
independently
chosen
random
samples
or
from
subjects
who
were
assigned
at
random
to
treatment
groups
Assumption
of
large
sample
size:
all
expected
counts
are
at
least
5.
If
some
expected
counts
are
less
than
5,
rows
or
columns
of
the
table
may
be
combined
to
achieve
a
table
with
expected
counts
at
least
5
Hypothesis
testing
Testing
for
homogeneity
Are
the
category
proportions
the
same
for
all
the
populations
or
treatments?
(classiOication
according
to
a
single
categorical
variable)
Testing
for
independence
of
two
categorical
variables
Association
between
two
categorical
variables
in
a
single
population
is
looked
for
(classiOication
according
to
two
categorical
variables)
Test
of
homogeneity
H0:
The
true
category
proportions
are
the
same
for
all
the
populations
or
treatments
(homogeneity
of
populations
or
treatments).
H1:
The
true
category
proportions
are
not
all
the
same
for
all
of
the
populations
or
treatments.
Test
statistic
(observed!cell!count expected!cell!count)2
2
=
all!cells
expected!cell!count
!
(row!marginal)(column!marginal)
expected!cell!count
=
grand!total
!
p-values
When
H0
is
true
and
the
assumptions
of
the
X2
test
are
satisOied,
2
2 (number!of!rows1)(number!of!columns1)
!
The
p-value
associated
with
the
computed
test
statistic
value
is
the
area
to
the
right
of
X2
under
the
chi-square
curve
with
the
appropriate
dof.
Test
statistic
(observed!cell!count expected!cell!count)2
2
=
all!cells
expected!cell!count
!
(row!marginal)(column!marginal)
expected!cell!count
=
grand!total
!
p-values
When
H0
is
true
and
the
assumptions
of
the
X2
test
are
satisOied,
2
2 (number!of!rows1)(number!of!columns1)
!
The
p-value
associated
with
the
computed
test
statistic
value
is
the
area
to
the
right
of
X2
under
the
chi-square
curve
with
the
appropriate
dof.
proportion!of!individuals
proportion!in
proportion!in
in!a!particular!category
=
specified!category
specified!category
combination
of!first!variable of!second!variable
!
observed!number observed!number
in!category
in!category
of!first!variable of!second!variable
expected!cell!count
= sample!size
sample!size
sample!size
!
proportion!of!individuals
proportion!in
proportion!in
in!a!particular!category
=
specified!category
specified!category
combination
of!first!variable of!second!variable
!
observed!number observed!number
in!category
in!category
of!first!variable of!second!variable
expected!cell!count
= sample!size
sample!size
sample!size
!
grand
total
proportion!of!individuals
proportion!in
proportion!in
in!a!particular!category
=
specified!category
specified!category
combination
of!first!variable of!second!variable
!
Example
Dog Cat
Marginal total
Male
42
10
052
Female
09
39
048
Marginal total
51
49
100
Dog
Cat
Male
42
(26.52)
10
(25.48)
Female
09
(24.48)
39
(23.52)
(42 26.52)2 (10 25.48)2 (9 24.48)2 (39 23.52)2
2
+
+
+
=
=
26.52
25.48
24.48
23.52
= 9.0358 + 9.4046 + 9.7888 +10.1884 = 38.4176
!
p-value = 1 cdf(chi2, 38.4176, 1) = 5.7115e-10
Example
Dog Cat
Marginal total
Male
27
25
052
Female
24
24
048
Marginal total
51
49
100
Dog
Cat
Male
27
(26.52)
25
(25.48)
Female
24
(24.48)
24
(23.52)
(27 26.52)2 (25 25.48)2 (24 24.48)2 (24 23.52)2
2
+
+
+
=
=
26.52
25.48
24.48
23.52
= 0.0087 + 0.0090+ 0.0094 + 0.0098 = 0.0369
!
p-value = 1 cdf(chi2, 0.0369, 1) = 0.8477
Calculation
! = 20.6 p-value = 1 cdf(chi2, 20.6, 4) = 3.8e-4
Interpretation
evidence
to
support
the
claim
that
the
proportions
in
the
number
of
Strong
concussions
categories
are
not
the
same
for
the
three
groups
compared.
2
Y
N(
x
,
)
!
(x i x )( y i y)
b = point!estimate!of! = i=1
(x i x )2
i=1
!a = point!estimate!of! = y b x
Let
x*
denote
a
speciOied
value
of
the
predictor
variable
x.
Then
!
a
+
b
x
*
has
two
different
interpretations:
as
a
point
estimate
of
the
mean
y-value
when
x
=
x*
as
a
point
prediction
of
an
individual
y-value
to
be
observed
when
x
=
x*
SSResid = ( y i y i )2
i=1
e =
!
SSResid
n2
b =
x-values
in
the
sample
are
spread
out
rather
than
S
xx
!
when
they
are
close
together,
and
when
little
variability
exists
about
the
population
line.
t=
b
Tn2
sb
sb =
!
se
S xx
confidence!interval
!b (t/critical!value)sb
Hypothesis
The
treadmill
run
time
predicts
the
time
to
run
a
20-km
ski
race
in
elite
biathletes
Study
Measurements
were
taken
in
11
US
elite
biathletes
Interpretation
We
are
95%
conOident
that
the
true
average
decrease
in
ski
time
associated
with
a
1-
minute
increase
in
treadmill
time
is
between
1
and
3.7
minutes
standardized!residual =
residual
estimated!residual!standard!deviation
se
2
1 (x i x )
1
n
S xx
The
prediction
interval
and
the
conOidence
interval
are
centered
at
the
same
place.
The
addition
of
!
s
e2
under
the
square-root
symbol
makes
the
prediction
interval
wider
than
the
conOidence
interval
1
1
p(x) =
exp x x
n
2
2
det()
!
Joint
probability
density
function
for
a
bivariate
normal
distribution:
Covariance
matrix
x
12
(symmetric,
deOinite
positive)
x = 1 , = 1 , = 11
x2
2
22
21
T
1
1
1
p(x) =
exp x x
2 det()
2
( )
Spherical
process
T
1 = [730 1090]
0
8000
1 =
8000
0
Diagonal
covariance
matrix
2 = [730 1090]
0
8000
2 =
0
18500
3 = [730 1090]
8000 8400
3 =
8400 18500
Information
sharing
between
the
two
axes!
Test
of
independence
Hypothesis
testing
H : =0
H : 0
! 1
Statistic
r
t=
Tn2
2
1 r
n2
!
Assumption
r
is
the
correlation
coefOicient
for
a
random
sample
from
a
bivariate
normal
population
Caveat
It
is
necessary
to
verify
bivariate
normality
of
the
sample
(not
easy,
especially
for
small
size)
y11
y12
y1n
y1.
y1.
y21
y22
y2n
y2.
y2.
ya1
ya2
yan
ya.
ya.
y..
y..
overall mean
yi.
j=1
i = 1,...,a
y..
i = + i
i=1
a
a
i = 0
i=1
H0 : 1 = 2 = ... = a
H1 : i j for!at!least!one!pair!(i, j)
!
Assump0ons
SS
=
(y y.. ) = [(yi. y.. )+ (yij yi. )]
T ij
2
i=1 j=1
a
i=1 j=1
(yij yi. ) = 0
!j=1
N-1
dof
a
N-a dof
a-1
dof
n
i=1 j=1
s =
!
2
p
i=1
SSE
Na
SSTreatments
a 1
(yi. y.. )2
i=1
a 1
2
estimates
under H0
n
estimates 2 under H0
SS
=
SS
+
SS
Treatments
E
! T
!N a
SSTreatments
! a 1
estimates 2 under H0
MS
=
E
SSE
Na
MSTreatments =
!
EF [MSE ] = 2
a
SSTreatments
a 1
EF [MSTreatments ] = 2 +
n i2
i=1
a 1
N1
2
SSTreatments
2
a1
2
SSE
2
Na
2
SSTreatments / (a 1)
! SSE / (N a)
SSTreatments
!
=
MSTreatments
MSE
and
SSE
are!independent
F(a 1,N a)
f0 =
!
SSTreatments / (a 1)
SSE / (N a)
> F ,a1,Na
'al1','al1','al1','al1','al1','al1',...
>>
'al2','al2','al2','al2','al2','al2'};
yij = + i + ij
= y..
!i = yi. y.. , i = 1,...,a
2
y
N(
/ n)
i
! i.
yi. t /2, Na
!
yi. y j. t /2, Na
2MSE
N
MSE
N
i yi. + t /2, Na
i j yi. y j. + t /2, Na
MSE
N
2MSE
N
yi. y j. t /2, Na
!
2MSE
N
i j yi. y j. + t /2, Na
2MSE
N
Bonferroni
correc0on
To
construct
a
set
of
simultaneous
correct
CIs,
replace
a/2
with
a/(2r)
in
the
one-at-a-0me
condence
intervals.
The
method
works
nicely
if
r
is
not
too
large.
Normality
assump0on
Construct
a
normal
probability
plot
of
the
residuals.
Problems
with
small
samples.
Moderate
departure
from
normality
does
not
necessarily
imply
a
serious
viola0on
of
the
assump0ons.
The
ANOVA
is
robust
to
the
normality
assump0on
(true
signicance
level
and
power
dier
slightly
from
the
s0pulated
values,
with
the
power
being
lower)
dij =
!
eij
MSE
N(0,1)
About
68%
and
95%
of
the
standardized
residuals
should
fall
within
the
limits
+/-1
+/-2.
H0 : 12 = 22 = ... = a2
2
H
:
above!not!true!for!at!least!one!
i
! 1
i=1
1 a
1
1
c = 1+
(n
1)
(N
a)
i
3(a 1) i=1
1 a
2
Sp =
(ni 1)Si2
N a i=1
!
q
2
= 2.3026 a1
c
02 > 2 , a1
!
2
0
In
a
normal
probability
plot,
if
all
the
data
points
fall
near
the
line,
an
assump0on
of
normality
is
reasonable.
Otherwise,
the
points
will
curve
away
from
the
line,
and
an
assump0on
of
normality
is
not
jus0ed.
Contrasts
H : =
0
4
5
H1 : 4 5
H0 : 4 5 = 0
H1 : 4 5 0
H0 : 1 + 2 = 4 + 5
!
= ci i
! i=1
H1 : 1 + 2 4 + 5
H0 : 1 + 2 4 5 = 0
H1 : 1 + 2 4 5 0 !
ci = 0
i=1
H0 : ci i = 0
i=1
a
H1 : ci i 0
i=1
var(C) = ci2 (n 2 )
N(0,1)
TNa or
c y
i=1 i i.
a
c
y
i i.
i=1
nMSE ci2
ci c =
*
i
n ci2
i=1
MSE
ci
n ci2
F1,Na
i=1
i=1
ci di = 0
!i=1
For
a
treatments
the
set
of
a-1
orthogonal
contrasts
par00on
the
sum
of
squares
due
to
treatments
into
a-1
independent
single-degree-of-
freedom
components.
Thus
tests
performed
on
orthogonal
contrasts
are
independent
Generally,
the
method
of
contrasts
(or
orthogonal
contrasts)
is
useful
for
what
are
called
preplanned
comparisons.
That
is,
the
contrasts
are
specied
prior
to
running
the
experiments
and
examining
the
data.
Sches
method
In
many
situa0ons,
experimenters
may
not
know
in
advance
which
contrasts
they
wish
tocompare.
For
running
a
preliminary
explora0on
of
data,
it
may
be
useful
to
ocmpare
any
and
all
possible
contrats
between
treatmenr
means.
a
u =
ciu i , u = 1,...,m
i=1
S ,u = SC
(a 1)F ,a1,Na
Cu > S ,u
!
Sches
method
Simluaneous
condence
intervls
The
probability
that
they
all
are
simultaneously
true
is
1-a.
Cu S ,u u Cu + S ,u
!
H0 : i = j for!all!i j
H : i j
! 1
Tukeys
Test
Tukeys
Least
Signicant
Dierence
(LSD)
Method
Tukeys
Honestly
Signicant
Dierence
(HSD)
Method
Duncans
Mul0ple
Range
Test
Dunnets
Test
Newman-Keuls
Test
Comments
Sche
conserva0vo
(da
non
usare
a
meno
che
non
si
sia
pianicato
di
eebuare
lanalisi
di
mol0
contras0)
Each
block
contain
all
treatments
and,
within
each
block,
the
order
in
which
the
0ps
are
tested
is
randomly
determined.
The
only
randomiza0on
of
treatments
is
within
the
blocks,
which
represent
then
a
restric0on
on
randomiza0on.
Examples
of
blocking
factors
are
batches
of,
e.g.:
raw
material
people
0me
yij = + i + j + ij
!
eect
of
the
i-th
treatment
a
i = 0
!i=1
H0 : 1 = 2 = ... = a
H1 : at!least!one i j
H0 : 1 = 2 = ... = a = 0
i = 1,2,...,a
j = 1,2,...,b
eect
of
the
j-th
block
b
Hypothesis
tes0ng
ij N(0, 2 )
!
H1 : at!least!one i 0
j = 0
!j=1
SS = (y y )2 , with N 1!degrees!of!freedom
T
ij
..
i=1 j=1
a
2
SSTreatments = b (yi. y.. ) , with a 1!degrees!of!freedom
i=1
a b
SS
SSTreatments SSBlocks
E
,
2 ,
2
2
!
E[MSTreatments ] = 2 +
b i2
i=1
a 1
E[MSBlocks ] = 2 +
2
E[MS
]
=
E
!
a 2j
j=1
b 1
F0 =
MSTreatments
MSE
Fa1,(a1)(b1)
MSBlocks
F0 =
Fb1,(a1)(b1)
MSE
!
Be
careful
that
the
blocks
represent
a
restric0on
on
randomiza0on,
while
randomiza0on
has
been
applied
only
within
blocks.
This
sta0s0c
would
not
be
used
to
test
equality
of
block
means.
It
is
only
reasonable
to
look
at
it
as
an
approximate
procedure
to
inves0gate
the
eect
of
the
blocking
variable.
The
randomized
block
design
reduces
the
amount
of
noise
in
the
data
suciently
for
dierences
among
the
four
0ps
to
be
detected.
Nuisance
factors
Batches
of
raw
material
and
operators
are
arranged
in
a
square.
Treatments
The
formula0ons
are
indicated
by
the
laGn
lebers
A,
B,
C,
D,
E.
Blocking
in
two
direc0ons
Two
restric0ons
of
randomiza0on
apply
in
this
design.
= 0
i
i=1
!
yijk = + i + j + k + ijk
Comment
ijk N(0, 2 )
!
i = 1,2,...,p
j = 1,2,...,p
k = 1,2,...,p
j = 0
!j=1
p2 1
!
!p 1
!p 1
!p 1
F
=
0
MSTreatments
MSE
Fp1,(p2)(p1)
F0 > F ,p1,(p2)(p1)
2
p
! 1 3(p 1) = (p 2)(p 1)
Between-subjects
ANOVA
Introduc0on
It
is
used
to
test
hypotheses
about
dierences
between
two
or
more
means.
Caveat
The
t-test
based
on
the
standard
error
of
the
dierence
between
two
means
can
only
be
used
to
test
dierences
between
two
means.
When
there
are
more
than
two
means,
it
is
possible
to
compare
each
mean
with
each
other
means
using
t-tests.
Severe
ina0on
of
the
type
I
error
rate.
Case
study
Factorial
designs
Deni0on
In
each
complete
trial
or
replica0on
of
the
experiment,
all
possible
combina0ons
of
the
levels
of
the
factors
are
inves0gated.
For
example,
if
there
are
a
levels
of
factor
A
and
b
levels
of
factor
B,
each
replicate
contains
all
ab
treatment
combina0ons.
Interac0on
It
occurs
when
the
dierence
in
response
between
the
levels
of
one
factor
is
not
the
same
at
all
levels
of
the
other
factors.
30
low
high
40
20
20
50
low
high
Factor B
60
Factor B
high
20
40
low
high
low
Factor A
Response
Response
Factor A
high
low
Factor
A
high
low
Factor
A
Design
parameters
Plate
material
of
the
babery
(three
possible
choices).
Lab
tests
performed
at
each
of
three
temperatures,
consistent
with
the
product
end-use
environment.
Four
baberies
tested
at
each
combina0on
of
plate
material
and
temperature.
Ques0ons
Eects
of
material
type
and
temperature
on
the
babery-life.
Is
there
any
material
that
would
give
uniformly
long
babery-
life
regardless
of
temperature?
(robust
product
design)
i = 1,...,a level!of!Factor!A!!!!!!!
j = 1,...,b level!of!Factor!B!!!!!!!!
k = 1,...,n number!of!replicates
StaGsGcal
analysis
Fixed
eects
model
j = 0
!j=1
i = 0
!i=1
Hypothesis
tes0ng
ijk N(0, 2 )
!
b
( )ij = ( )ij = 0
j=1
!i=1
Row treatment
Column treatment
H0 : 1 = 2 = ... = a = 0
H0 : 1 = 2 = ... = b = 0
!H1 : i 0 for!at!least!one!i
!H1 : i 0 for!at!least!one!i
Interac0on
StaGsGcal
analysis
Row,
column,
cell
and
grand
total
and
averages
yi..
y . j. = yijk , y. j. =
bn
y. j.
i=1 k=1
n
yij.
k=1
an
, i = 1,...,a
, j = 1,...,b
, i = 1,...,a; j = 1,...,b
y...
StaGsGcal
analysis
Equa0on
of
the
total
corrected
sum
of
squares
SST = SS A + SSB + SS AB + SE
i=1
b
2
SS
=
bn
(y
y
)
A
i..
...
a b
a b
Effect&
Degrees&of&freedom&
A"
a!!1!
B"
b!!1!
AB"interaction"
(a!!1)&(b!!1)!
Error!
a&b&(n!!1)!
Total!
a&b&n!!1!
StaGsGcal
analysis
Each
sum
of
squares
divided
by
its
degrees
of
freedom
is
a
mean
square.
2
E[MS
]
=
+
A
bn i2
i=1
a 1
an
E[MSB ] = 2 +
i=1
2
i
b 1
a b
E[MS AB ] = +
2
2
E[MS
]
=
E
!
n ( )2ij
i=1 j=1
(a 1)(b 1)
MS A
Fa1,ab(n1)
MSE
MSB
MSE
Fb1,ab(n1)
MS AB
! MSE
F(a1)(b1),ab(n1)
R =
2
SSModel
SST
from
intermediate
to
high
temperature,
baberies
with
material
types
2,
3
live
shorter,
whilst
babery
with
material
type
1
does
not
change
MSE
n
= 45.47
y12. = 57.25
y22. = 119.75
!y32. = 145.75
possible outliers
When
to
use
Studies
that
inves0gate
changes
in
mean
scores
over
three
or
more
0me
points,
or
dierences
in
mean
scores
under
three
or
more
treatments
(the
same
subjects
are
being
measured
more
than
once
on
the
same
dependent
variable).
Between-subjects
ANOVA
The
total
variability
is
par00oned
into
between-groups
variability
and
within-groups
variability
Within-subjects
ANOVA
The
within-groups
variability
is
further
par00oned,
making
the
error
term
smaller
Approach
Approach
Each
subject
is
treated
as
a
block,
i.e.,
each
subject
becomes
a
level
of
a
factor
called
subjects.
The
variability
of
the
within-subjects
factor
can
be
computed
exactly
as
we
do
with
any
between-subjects
factor.
The
error
variability
accounts
only
for
individual
variability
to
each
condi0on.
RBCD
Underlying
assumpGons
Normality
Each
level
of
the
dependent
variable
needs
to
be
approximately
normally
distributed.
Sphericity
The
concept
of
sphericity
is
the
repeated
measures
equivalent
of
homogeneity
of
variances.
Sphericity
is
the
condi0on
where
the
variances
of
the
dierences
between
all
combina0ons
of
related
groups
(levels)
are
equal.
Viola0on
of
sphericity
causes
the
test
to
become
too
liberal
(increase
of
the
Type
I
error
rate).
Tes0ng
for
sphericity
is
usually
done
using
the
Mauchlys
test.