Probability and Statistics

Probability and Statistics
Cookbook
c Matthias Vallentin, 2011

Copyright
vallentin@icir.org
12th December, 2011
12 Parametric Inference
12.1 Method of Moments . . . . . . . . .
12.2 Maximum Likelihood . . . . . . . . .
versity of California in Berkeley but also influenced by other
12.2.1 Delta Method . . . . . . . . .
sources [4, 5]. If you find errors or have suggestions for further
12.3 Multiparameter Models . . . . . . .
topics, I would appreciate if you send me an email. The most re12.3.1 Multiparameter delta method
cent version of this document is available at http://matthias.
12.4 Parametric Bootstrap . . . . . . . .
15 Exponential Family
11 20 Stochastic Processes
20.1 Markov Chains . . . . . . . . . .
11
20.2 Poisson Processes . . . . . . . . .
12
12
21 Time Series
12
21.1 Stationary Time Series . . . . . .
13
21.2 Estimation of Correlation . . . .
13
21.3 Non-Stationary Time Series . . .
21.3.1 Detrending . . . . . . . .
13
21.4 ARIMA models . . . . . . . . . .
21.4.1 Causality and Invertibility
14
21.5 Spectral Analysis . . . . . . . . .
14
14 22 Math
22.1 Gamma Function . . . . . . . . .
15
22.2 Beta Function . . . . . . . . . . .
15
22.3 Series . . . . . . . . . . . . . . .
15
22.4 Combinatorics . . . . . . . . . .
16
16 Sampling Methods
16.1 The Bootstrap . . . . . . . . . . . . .
16.1.1 Bootstrap Confidence Intervals
16.2 Rejection Sampling . . . . . . . . . . .
16.3 Importance Sampling . . . . . . . . . .
16
16
16
17
17
This cookbook integrates a variety of topics in probability theory and statistics. It is based on literature [1, 6, 3] and in-class
material from courses of the statistics department at the Uni-
vallentin.net/probability-and-statistics-cookbook/. To
reproduce, please contact me.
2 Probability Theory
3 Random Variables
3.1 Transformations . . . . . . . . . . . . .
6
7
4 Expectation
5 Variance
6 Inequalities
1 Distribution Overview
1.1 Discrete Distributions . . . . . . . . . .
1.2 Continuous Distributions . . . . . . . .
7 Distribution Relationships
8 Probability
Functions
and
Moment
9 Multivariate Distributions
9.1 Standard Bivariate Normal . . . . . . .
9.2 Bivariate Normal . . . . . . . . . . . . .
9.3 Multivariate Normal . . . . . . . . . . .
10 Convergence
10.1 Law of Large Numbers (LLN) . . . . . .
10.2 Central Limit Theorem (CLT) . . . . .
11 Statistical Inference
11.1 Point Estimation . . . . . . . . . .
11.2 Normal-Based Confidence Interval
11.3 Empirical distribution . . . . . . .
11.4 Statistical Functionals . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17 Decision Theory
17.1 Risk . . . . . . . . . . . . . . . . . . . .
17.2 Admissibility . . . . . . . . . . . . . . .
17.3 Bayes Rule . . . . . . . . . . . . . . . .
9
17.4 Minimax Rules . . . . . . . . . . . . . .
9
9 18 Linear Regression
9
18.1 Simple Linear Regression . . . . . . . .
9
18.2 Prediction . . . . . . . . . . . . . . . . .
18.3 Multiple Regression . . . . . . . . . . .
9
18.4 Model Selection . . . . . . . . . . . . . .
10
10
19 Non-parametric Function Estimation
19.1 Density Estimation . . . . . . . . . . . .
10
19.1.1 Histograms . . . . . . . . . . . .
10
19.1.2 Kernel Density Estimator (KDE)
11
19.2 Non-parametric Regression . . . . . . .
11
11
19.3 Smoothing Using Orthogonal Functions
8
Generating
.
.
.
.
.
.
13 Hypothesis Testing
14 Bayesian Inference
14.1 Credible Intervals . . . .
14.2 Function of parameters .
3
3
14.3 Priors . . . . . . . . . .
4
14.3.1 Conjugate Priors
14.4 Bayesian Testing . . . .
6
Contents
.
.
.
.
.
.
17
17
17
18
18
18
18
19
19
19
20
20
20
21
21
21
22
. . . . 22
. . . . 22
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
24
24
24
24
25
25
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
26
26
27
27
Distribution Overview
1.1
Discrete Distributions
Notation1
FX (x)
Unif {a, . . . , b}
Uniform
x<a
axb
x>b
(1 p)1x
bxca+1
ba
1
Bernoulli
Bern (p)
Binomial
Multinomial
Hypergeometric
Hyp (N, m, n)
Negative Binomial
NBin (r, p)
Geometric
Geo (p)
Poisson
Po ()
MX (s)
I(a < x < b)

ba+1
a+b
2
(b a + 1)2 1
12
eas e(b+1)s
s(b a)
px (1 p)1x
p(1 p)
1 p + pes
np
np(1 p)
(1 p + pes )n
x N+
x
X
i
i!
i=0
Uniform (discrete)
1p
p
1p
p2
1
p
1p
p2
x N+
x e
x!
p
1 (1 p)es
e(e
1)
Poisson
p = 0.2
p = 0.5
p = 0.8
=1
=4
= 10
0.2
0.6
PMF
0.4
PMF
0.15
PMF
0.10
0.05
x
1 We
10
20
x
30
40
0.0
0.0
0.1
0.2
0.00
PMF
0.20
0.3
0.25
n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
r
p
1 (1 p)es
Geometric
pi e
nm(N n)(N m)
N 2 (N 1)
nm
N
r
Binomial
!n
si
i=0
nx

N
x
p(1 p)x1
k
X
npi (1 pi )
npi
!
x+r1 r
p (1 p)x
r1
Ip (r, x + 1)
1 (1 p)x
V [X]
k
X
n!
x
xi = n
px1 1 pkk
x1 ! . . . xk !
i=1

m mx
Mult (n, p)
x np
p
np(1 p)
E [X]
!
n x
p (1 p)nx
x
I1p (n x, x + 1)
Bin (n, p)
fX (x)
0.8
10
10
15
20
use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).
3
1.2
Continuous Distributions
Notation
FX (x)
Uniform
Unif (a, b)
Normal
N , 2
x<a
a<x<b
1
x>b
Z x
(x) =
(t) dt
xa
ba
Log-Normal
ln N , 2
Multivariate Normal
MVN (, )
Students t
Student()
Chi-square
2k

1
1
ln x
+ erf
2
2
2 2
fX (x)
E [X]
V [X]
MX (s)
I(a < x < b)

ba

(x )2
1
(x) = exp
2 2
2

(ln x )2
1
exp
2 2
x 2 2
a+b
2
(b a)2
12
esb esa
s(b a)

2 s2
exp s +
2
(2)k/2 ||1/2 e 2 (x)

Ix
,
2 2

1
k x
,
(k/2)
2 2
1 (x)

(+1)/2
+1
x2
2
1+
2
1
xk/21 ex/2
2k/2 (k/2)
r
e+
/2
(e 1)e2+
2k
d2
d2 2
2d22 (d1 + d2 2)
d1 (d2 2)2 (d2 4)

1
exp T s + sT s
2
(1 2s)k/2 s < 1/2

F
F(d1 , d2 )
Exponential
Exp ()
Gamma
Inverse Gamma
Dirichlet
Beta
Weibull
Pareto
d1 x
d1 x+d2
Gamma (, )
InvGamma (, )
d1 d1
,
2 2
Pareto(xm , )
(, x/)
()

, x
()
1
x1 ex/
()
1 /x
x
e
()
P

k
k
i=1 i Y 1
xi i
Qk
i=1 (i ) i=1
>1
1
2
>2
( 1)2 ( 2)2
( + ) 1
x
(1 x)1
() ()
+

1
1 +
k
( + )2 ( + + 1)

2
2 1 +
2
k
xm
>1
1
x
m
>2
( 1)2 ( 2)
1 e(x/)
1
d1
, 2
2
1 x/
e
Ix (, )
Weibull(, k)
xB

d1
1 ex/
Dir ()
Beta (, )
(d1 x)d1 d2 2
(d1 x+d2 )d1 +d2
x
m
x xm
k x k1 (x/)k
e

x
m
+1
x
x xm
i
Pk
i=1 i
1
(s < 1/)
1 s

1
(s < 1/)
1 s

p
2(s)/2
4s
K
()
E [Xi ] (1 E [Xi ])
Pk
i=1 i + 1
1+
X
k=1
X
n=0
k1
Y
r=0
+r
++r
sk
k!
sn n
n
1+
n!
k
(xm s) (, xm s) s < 0
Normal
Lognormal
Student's t
1.0
= 0, 2 = 3
= 2, 2 = 2
= 0, 2 = 1
= 0.5, 2 = 1
= 0.25, 2 = 1
= 0.125, 2 = 1
=1
=2
=5
=
0.2
PDF
PDF
0.4
(x)
PDF
0.6
0.6
0.3
0.8
0.8
= 0, 2 = 0.2
= 0, 2 = 1
= 0, 2 = 5
= 2, 2 = 0.5
0.4
Uniform (continuous)
0.1
4
0.0
0.0
0.2
0.2
0.0
0.0
0.5
1.0
1.5
2.5
3.0
Exponential
Gamma
2.0
=2
=1
= 0.4
= 1, = 2
= 2, = 2
= 3, = 2
= 5, = 1
= 9, = 0.5
0.4
3.0
d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100
0.3
PDF
0.2
1.0
PDF
PDF
1.5
0.1
0
0.0
0
Inverse Gamma
Beta
10
15
20
Pareto
xm = 1, = 1
xm = 1, = 2
xm = 1, = 4
= 1, k = 0.5
= 1, k = 1
= 1, k = 1.5
= 1, k = 5
2.0
2.5
Weibull
= 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5
3.0
= 1, = 1
= 2, = 1
= 3, = 1
= 3, = 0.5
3
x
3
x
0.0
0.2
0.4
0.6
x
0.8
1.0
2
0
0.0
0.0
0.5
0.5
1.0
1.0
PDF
PDF
1.5
PDF
2
PDF
1.5
2.0
2.5
0.0
0.0
0.0
0.5
0.1
0.5
1.0
0.2
PDF
0.3
2.0
1.5
2.5
0.4
0.5
k=1
k=2
k=3
k=4
k=5
2.0
0.5
0.4
1
ba
0.0
0.5
1.0
1.5
x
2.0
2.5
Probability Theory
Law of Total Probability
Definitions
P [B] =
Sample space
Outcome (point or element)
Event A
-algebra A
P [B|Ai ] P [Ai ]
n
G
Ai
i=1
Bayes Theorem
P [B | Ai ] P [Ai ]
P [Ai | B] = Pn
j=1 P [B | Aj ] P [Aj ]
Inclusion-Exclusion Principle
n

n
[ X
Ai =
(1)r1

Probability Distribution P
i=1
1. P [A] 0 A
2. P [] = 1
" #
G
X
3. P
Ai =
P [Ai ]
r=1
n
G
Ai
i=1

r

\

Aij

ii1 <<ir n j=1
Random Variables
Random Variable (RV)
i=1
X:R
Probability space (, A, P)
Probability Mass Function (PMF)
Properties
i=1
1. A
S
2. A1 , A2 , . . . , A = i=1 Ai A
3. A A = A A
i=1
n
X
P [] = 0
B = B = (A A) B = (A B) (A B)
P [A] = 1 P [A]
P [B] = P [A B] + P [A B]
P [] = 1
P [] = 0
S
T
T
S
( n An ) = n An ( n An ) = n An
DeMorgan
S
T
P [ n An ] = 1 P [ n An ]
P [A B] = P [A] + P [B] P [A B]
= P [A B] P [A] + P [B]
P [A B] = P [A B] + P [A B] + P [A B]
P [A B] = P [A] P [A B]
fX (x) = P [X = x] = P [{ : X() = x}]

Probability Density Function (PDF)
Z
P [a X b] =
f (x) dx
a
Cumulative Distribution Function (CDF)

FX : R [0, 1]
FX (x) = P [X x]
1. Nondecreasing: x1 < x2 = F (x1 ) F (x2 )

2. Normalized: limx = 0 and limx = 1
3. Right-Continuous: limyx F (y) = F (x)
Continuity of Probabilities
A1 A2 . . . = limn P [An ] = P [A]
A1 A2 . . . = limn P [An ] = P [A]
S
whereA = i=1 Ai
T
whereA = i=1 Ai
ab
Independence
fY |X (y | x) =
A
B P [A B] = P [A] P [B]
f (x, y)
fX (x)
Independence
Conditional Probability
P [A | B] =
fY |X (y | x)dy
P [a Y b | X = x] =
P [A B]
P [B]
P [B] > 0
1. P [X x, Y y] = P [X x] P [Y y]
2. fX,Y (x, y) = fX (x)fY (y)
6
3.1
Transformations
E [XY ] =
xyfX,Y (x, y) dFX (x) dFY (y)

X,Y
Transformation function
E [(Y )] 6= (E [X])
(cf. Jensen inequality)
P [X Y ] = 0 = E [X] E [Y ] P [X = Y ] = 1 = E [X] = E [Y ]
X
E [X] =
P [X x]
Z = (X)
Discrete
X

fZ (z) = P [(X) = z] = P [{x : (x) = z}] = P X 1 (z) =
f (x)
x1 (z)
x=1
Sample mean
n
X
n = 1
X
Xi
n i=1
Continuous
Z
FZ (z) = P [(X) z] =
with Az = {x : (x) z}
f (x) dx
Az
Special case if strictly monotone

dx

d
1
fZ (z) = fX (1 (z)) 1 (z) = fX (x) = fX (x)
dz
dz
|J|
Conditional expectation
Z
E [Y | X = x] = yf (y | x) dy
E [X] = E [E [X | Y ]]
E[(X, Y ) | X = x] =
(x, y)fY |X (y | x) dx
Z
The Rule of the Lazy Statistician

E [(Y, Z) | X = x] =
Z
E [Z] =
E [Y + Z | X] = E [Y | X] + E [Z | X]
E [(X)Y | X] = (X)E [Y | X]
E[Y | X] = c = Cov [X, Y ] = 0
Z
dFX (x) = P [X A]
IA (x) dFX (x) =
(y, z)f(Y,Z)|X (y, z | x) dy dz
(x) dFX (x)
Z
E [IA (x)] =
Convolution
Z
Z := X + Y
fX,Y (x, z x) dx
fZ (z) =
X,Y 0
fX,Y (x, z x) dx
Z := |X Y |
Z :=
X
Y
fZ (z) = 2
fX,Y (x, z + x) dx
0
Z
Z
fZ (z) =
|x|fX,Y (x, xz) dx =
xfx (x)fX (x)fY (xz) dx
Expectation
Variance
Definition and properties

2
2
V [X] = X
= E (X E [X])2 = E X 2 E [X]
" n
#
n
X
X
X
V
Xi =
V [Xi ] + 2
Cov [Xi , Yj ]
i=1
" n
X
#
Xi =
i=1
Definition and properties
Z
E [X] = X =
x dFX (x) =
i6=j
V [Xi ]
iff Xi
Xj
i=1
Standard deviation
X
xfX (x)
x
Z
xfX (x)
P [X = c] = 1 = E [c] = c
E [cX] = c E [X]
E [X + Y ] = E [X] + E [Y ]
i=1
n
X
sd[X] =
X discrete
V [X] = X
Covariance
X continuous
Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]

Cov [X, a] = 0
Cov [X, X] = V [X]
Cov [X, Y ] = Cov [Y, X]
Cov [aX, bY ] = abCov [X, Y ]
Cov [X + a, Y + b] = Cov [X, Y ]
n
m
n X
m
X
X
X
Cov
Xi ,
Yj =
Cov [Xi , Yj ]
i=1
j=1
Correlation
limn Bin (n, p) = Po (np)

(n large, p small)
limn Bin (n, p) = N (np, np(1 p))
(n large, p far from 0 and 1)
Negative Binomial
i=1 j=1
Cov [X, Y ]
[X, Y ] = p
V [X] V [Y ]
Independence
X NBin (1, p) = Geo (p)

Pr
X NBin (r, p) = i=1 Geo (p)
P
P
Xi NBin (ri , p) =
Xi NBin ( ri , p)
X NBin (r, p) . Y Bin (s + r, p) = P [X s] = P [Y r]
Poisson
X
Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ]
Xi Po (i ) Xi Xj =
n
X
Xi Po
i=1
Sample variance
n
X
!
i
i=1
X
n
X
n
i
Xj Bin
Xj , Pn
Xi Po (i ) Xi Xj = Xi
j=1
j=1 j
j=1
1 X
n )2
S2 =
(Xi X
n 1 i=1
Conditional variance

2
V [Y | X] = E (Y E [Y | X])2 | X = E Y 2 | X E [Y | X]
V [Y ] = E [V [Y | X]] + V [E [Y | X]]
Exponential
Xi Exp () Xi
Xj =
n
X
Xi Gamma (n, )
i=1
Memoryless property: P [X > x + y | X > y] = P [X > x]
Inequalities
Normal
X N , 2
Cauchy-Schwarz

E [XY ] E X 2 E Y 2
Markov
P [(X) t]
E [(X)]
t
Chebyshev
P [|X E [X]| t]
Chernoff

P [X (1 + )]
> 1
E [(X)] (E [X]) convex
Distribution Relationships
Binomial
Xi Bern (p) =
n
X
Gamma
Jensen
N (0, 1)

X N , Z = aX + b = Z N a + b, a2 2

X N 1 , 12 Y N 2 , 22 = X + Y N 1 + 2 , 12 + 22

P
P
P
2
Xi N i , i2 =
X N
i i ,
i i
i i

a
P [a < X b] = b
(x) = 1 (x)
0 (x) = x(x)
00 (x) = (x2 1)(x)
1
Upper quantile of N (0, 1): z = (1 )
V [X]
t2
e
(1 + )1+
Xi Bin (n, p)
i=1
X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n + m, p)
X Gamma (, ) X/ Gamma (, 1)
P
Gamma (, ) i=1 Exp ()
P
P
Xi Gamma (i , ) Xi
Xj =
Xi Gamma ( i i , )
i
Z
()
=
x1 ex dx
0
Beta
1
( + ) 1
x1 (1 x)1 =
x
(1 x)1
B(, )
()()
B( + k, )

+k1
E Xk =
=
E X k1
B(, )
++k1
Beta (1, 1) Unif (0, 1)
8
Probability and Moment Generating Functions

GX (t) = E tX
X
(Y E [Y ])
Y
p
V [X | Y ] = X 1 2
E [X | Y ] = E [X] +
|t| < 1
Xt
MX (t) = GX (e ) = E e
=E
"
#
X (Xt)i
i!
i=0

X
E Xi
=
ti
i!
i=0
P [X = 0] = GX (0)
P [X = 1] = G0X (0)
P [X = i] =
Conditional mean and variance
9.3
(i)
GX (0)
Multivariate Normal
(Precision matrix 1 )
V [X1 ]
Cov [X1 , Xk ]
..
..
..
=
.
.
.
Covariance matrix
i!
E [X] = G0X (1 )

(k)
E X k = MX (0)

X!
(k)
E
= GX (1 )
(X k)!
Cov [Xk , X1 ]
If X N (, ),
V [X] = G00X (1 ) + G0X (1 ) (G0X (1 ))

d
n/2
GX (t) = GY (t) = X = Y
9
9.1
V [Xk ]
fX (x) = (2)
||
1
exp (x )T 1 (x )
2
Properties
Multivariate Distributions
Standard Bivariate Normal
Let X, Y N (0, 1) X
Z where Y = X +
1/2
1 2 Z
Z N (0, 1) X = + 1/2 Z = X N (, )
X N (, ) = 1/2 (X ) N (0, 1)

X N (, ) = AX N A, AAT

X N (, ) kak = k = aT X N aT , aT a
Joint density
f (x, y) =

2
1
x + y 2 2xy
p
exp
2(1 2 )
2 1 2
10
Convergence
Let {X1 , X2 , . . .} be a sequence of rvs and let X be another rv. Let Fn denote
the cdf of Xn and let F denote the cdf of X.
Conditionals
(Y | X = x) N x, 1 2
and
(X | Y = y) N y, 1 2

Types of convergence
Independence
1. In distribution (weakly, in law): Xn X
X
Y = 0
lim Fn (t) = F (t)
9.2
Bivariate Normal
2. In probability: Xn X

Let X N x , x2 and Y N y , y2 .
f (x, y) =
"
z=
x x
x
2x y
2

+
1
p
z
exp
2
2(1 2 )
1
y y
y
t where F continuous
2

2
x x
x

( > 0) lim P [|Xn X| > ] = 0
y y
y
n
as
#
3. Almost surely (strongly): Xn X

h
i
h
i
P lim Xn = X = P : lim Xn () = X() = 1
n
qm
4. In quadratic mean (L2 ): Xn X
CLT notations
Zn N (0, 1)

2
Xn N ,
n

2
Xn N 0,
n
2
n ) N 0,
n(X

n(Xn )
N (0, 1)
n

lim E (Xn X)2 = 0
Relationships
qm
Xn X = Xn X = Xn X
as
P
Xn X = Xn X
D
P
Xn X (c R) P [X = c] = 1 = Xn X
Xn
Xn
Xn
Xn
X
qm
X
P
X
P
X
Yn
Yn
Yn
=
Y = Xn + Yn X + Y
qm
qm
Y = Xn + Yn X + Y
P
P
Y = Xn Yn XY
P
(Xn ) (X)
Continuity correction

x + 12
/ n

x 12
P Xn x 1
/ n
Xn X = (Xn ) (X)
qm
Xn b limn E [Xn ] = b limn V [Xn ] = 0
qm
n
X1 , . . . , Xn iid E [X] = V [X] < X

n x
P X
Slutzkys Theorem
D
Xn X and Yn c = Xn + Yn X + c
D
P
D
Xn X and Yn c = Xn Yn cX
D
D
D
In general: Xn X and Yn Y =
6
Xn + Yn X + Y
10.1
Delta method

Yn N
11
Law of Large Numbers (LLN)
2
,
n

= (Yn ) N
2
(), ( ())
n
0
Statistical Inference
iid
Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] < .
Let X1 , , Xn F if not otherwise noted.
Weak (WLLN)
11.1
n
Point estimator bn of is a rv: bn = g(X1 , . . . , Xn )

h i
bias(bn ) = E bn
as
n
X
P
Consistency: bn
Sampling distribution: F (bn )
r h i
b
Standard error: se(n ) = V bn
h
i
h i
Mean squared error: mse = E (bn )2 = bias(bn )2 + V bn
Strong (WLLN)
10.2
Central Limit Theorem (CLT)
Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] = 2 .

n
n(Xn ) D
X
Z
Zn := q =
n
V X
lim P [Zn z] = (z)
Point Estimation
n
X
P
where Z N (0, 1)
zR
limn bias(bn ) = 0 limn se(bn ) = 0 = bn is consistent

bn D
Asymptotic normality:
N (0, 1)
se
Slutzkys Theorem often lets us replace se(bn ) by some (weakly) consistent estimator
bn .
10
11.2
Normal-Based Confidence Interval

b 2 . Let z/2
Suppose bn N , se

and P z/2 < Z < z/2 = 1 where Z N (0, 1). Then

= 1 (1 (/2)), i.e., P Z > z/2 = /2
b
Cn = bn z/2 se
11.4
Statistical Functionals
Statistical functional: T (F )
Plug-in estimator of = (F ): bn = T (Fbn )
R
Linear functional: T (F ) = (x) dFX (x)
Plug-in estimator for linear functional:
n
11.3
1X
(Xi )
(x) dFbn (x) =
n i=1

b 2 = T (Fbn ) z/2 se
b
Often: T (Fbn ) N T (F ), se
T (Fbn ) =
Empirical distribution
Empirical Distribution Function (ECDF)

Pn
Fbn (x) =
i=1
I(Xi x)
n
(
1
I(Xi x) =
0
Xi x
Xi > x
Properties (for any fixed x)

h i
E Fbn = F (x)
h i F (x)(1 F (x))
V Fbn =
n
F (x)(1 F (x)) D
mse =
0
n
P
Fbn F (x)
Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn F )

2

P sup F (x) Fbn (x) > = 2e2n
pth quantile: F 1 (p) = inf{x : F (x) p}

n

b=X
n
1 X
n )2

b2 =
(Xi X
n 1 i=1
Pn
1
b)3
i=1 (Xi
n

b=
b3 j
Pn
n )(Yi Yn )
(Xi X
qP
b = qP i=1
n
n
2
(X
X
)
i
n
i=1
i=1 (Yi Yn )
12
Parametric Inference

Let F = f (x; ) : be a parametric model with parameter space Rk
and parameter = (1 , . . . , k ).
12.1
Method of Moments
j th moment

j () = E X j =
Nonparametric 1 confidence band for F

L(x) = max{Fbn n , 0}
U (x) = min{Fbn + n , 1}
s

1
2
log
=
2n
P [L(x) F (x) U (x) x] 1
xj dFX (x)
j th sample moment
n
bj =
1X j
X
n i=1 i
Method of moments estimator (MoM)

1 () =
b1
2 () =
b2
.. ..
.=.
k () =
bk
11
Properties of the MoM estimator

bn exists with probability tending to 1
P
Consistency: bn
n(b ) N (0, )

where = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T ,
1
g = (g1 , . . . , gk ) and gj =
j ()
12.2
Maximum Likelihood
Likelihood: Ln : [0, )
Ln () =
n
Y
f (Xi ; )
Equivariance: bn is the mle = (bn ) ist the mle of ()

p
1. se 1/In ()
(bn ) D
N (0, 1)
se
q
b 1/In (bn )
2. se
(bn ) D
N (0, 1)
b
se
Asymptotic optimality (or efficiency), i.e., smallest variance for large samples. If en is any other estimator, the asymptotic relative efficiency is
h i
V bn
are(en , bn ) = h i 1
V en
i=1
Approximately the Bayes estimator

Log-likelihood
`n () = log Ln () =
n
X
log f (Xi ; )
i=1
12.2.1
Delta Method
b where is differentiable and 0 () 6= 0:

If = ()
Maximum likelihood estimator (mle)
(b
n ) D
N (0, 1)
b )
se(b
Ln (bn ) = sup Ln ()
b is the mle of and

where b = ()
Score function
s(X; ) =
log f (X; )

b b
b = 0 ()
b n )
se
se(
Fisher information
I() = V [s(X; )]
In () = nI()
Fisher information (exponential family)

I() = E s(X; )
Observed Fisher information

Inobs () =
Properties of the mle
P
Consistency: bn
12.3
Multiparameter Models
Let = (1 , . . . , k ) and b = (b1 , . . . , bk ) be the mle.

Hjj =
2 `n
2
Hjk =
Fisher information matrix
n
2 X
log f (Xi ; )
2 i=1
2 `n
j k
..
.
E [Hk1 ]
E [H11 ]
..
In () =
.
E [H1k ]
..
.
E [Hkk ]
Under appropriate regularity conditions

(b ) N (0, Jn )
12
with Jn () = In1 . Further, if bj is the j th component of , then
13
Hypothesis Testing
H0 : 0
(bj j ) D
N (0, 1)
bj
se
h
i
b 2j = Jn (j, j) and Cov bj , bk = Jn (j, k)
where se
12.3.1
Multiparameter delta method
Let = (1 , . . . , k ) and let the gradient of be
1
.
=
..

k
Definitions
Null hypothesis H0
Alternative hypothesis H1
Simple hypothesis = 0
Composite hypothesis > 0 or < 0
Two-sided test: H0 : = 0 versus H1 : 6= 0
One-sided test: H0 : 0 versus H1 : > 0
Critical value c
Test statistic T
Rejection region R = {x : T (x) > c}
Power function () = P [X R]
Power of a test: 1 P [Type II error] = 1 = inf ()
(b
) D
N (0, 1)
b )
se(b
H0 true
H1 true
T
Type II Error ()
1F (T (X))
p-value
< 0.01
0.01 0.05
0.05 0.1
> 0.1

b
Jbn

b and
b = b.
and Jbn = Jn ()
=
12.4
Retain H0
Reject H0
Type
I Error ()
(power)

p-value = sup0 P [T (X) T (x)] = inf : T (x) R

p-value = sup0
P [T (X ? ) T (X)]
= inf : T (X) R
|
{z
}
where
b ) =
se(b
Test size: = P [Type I error] = sup ()
p-value

b Then,
Suppose =b 6= 0 and b = ().
r

H1 : 1
versus
Parametric Bootstrap
Sample from f (x; bn ) instead of from Fbn , where bn could be the mle or method
of moments estimator.
since T (X ? )F
evidence
very strong evidence against H0
strong evidence against H0
weak evidence against H0
little or no evidence against H0
Wald test
Two-sided test
b 0
Reject H0 when |W | > z/2 where W =
b
se

P |W | > z/2
p-value = P0 [|W | > |w|] P [|Z| > |w|] = 2(|w|)
Likelihood ratio test (LRT)
T (X) =
sup Ln ()
Ln (bn )
=
sup0 Ln ()
Ln (bn,0 )
13
(X) = 2 log T (X) 2rq where
k
X
iid
Zi2 2k and Z1 , . . . , Zk N (0, 1)
i=1

p-value = P0 [(X) > (x)] P 2rq > (x)
Multinomial LRT

Xk
X1
,...,
mle: pbn =
n
n
Xj
k
Y
Ln (b
pn )
pbj
T (X) =
=
Ln (p0 )
p0j
j=1

k
X
pbj
D
Xj log
(X) = 2
2k1
p
0j
j=1
xn = (x1 , . . . , xn )
Prior density f ()
Likelihood f (xn | ): joint density of the data
n
Y
In particular, X n iid = f (xn | ) =
f (xi | ) = Ln ()
i=1
Posterior density f ( | xn )
R
Normalizing constant cn = f (xn ) = f (x | )f () d
Kernel: part of a density that depends Ron
R
Ln ()f ()
Posterior mean n = f ( | xn ) d = R
Ln ()f () d
14.1
The approximate size LRT rejects H0 when (X) 2k1,
Credible Intervals
Posterior interval
Pearson Chi-square Test
k
X
(Xj E [Xj ])2
where E [Xj ] = np0j under H0
T =
E [Xj ]
j=1
D
f ( | xn ) d = 1
P [ (a, b) | x ] =
a
Equal-tail credible interval
2k1
T

p-value = P 2k1 > T (x)
f ( | xn ) d =
2
Faster Xk1
than LRT, hence preferable for small n
f ( | xn ) d = /2
Highest posterior density (HPD) region Rn
Independence testing
I rows, J columns, X multinomial sample of size n = I J
X
mles unconstrained: pbij = nij
X
mles under H0 : pb0ij = pbi pbj = Xni nj

PI PJ
nX
LRT: = 2 i=1 j=1 Xij log Xi Xijj
PI PJ (X E[X ])2
PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij
D
LRT and Pearson 2k , where = (I 1)(J 1)
14
1. P [ Rn ] = 1
2. Rn = { : f ( | xn ) > k} for some k
Rn is unimodal = Rn is an interval
14.2
Function of parameters
Let = () and A = { : () }.
Posterior CDF for
Bayesian Inference
H(r | xn ) = P [() | xn ] =
f ( | xn ) d
Bayes Theorem
Posterior density
f (x | )f ()
f (x | )f ()
f ( | x) =
=R
Ln ()f ()
n
f (x )
f (x | )f () d
h( | xn ) = H 0 ( | xn )
Bayesian delta method
Definitions
n
X = (X1 , . . . , Xn )

b
b se
b 0 ()
| X n N (),

14
14.3
Priors
Continuous likelihood (subscript c denotes constant)

Likelihood
Conjugate prior
Unif (0, )
Pareto(xm , k)
Exp ()
Gamma (, )
Choice
Subjective bayesianism.
Objective bayesianism.
Robust bayesianism.
i=1

2

2
N , c
N 0 , 0
Types
N c , 2
Flat: f () constant
R
Proper: f () d = 1
R
Improper: f () d =
Jeffreys prior (transformation-invariant):
f ()
I()
f ()
N , 2
Scaled Inverse Chisquare(, 02 )
Normalscaled
Inverse
Gamma(, , , )
det(I())
MVN(, c )
MVN(0 , 0 )
Conjugate: f () and f ( | xn ) belong to the same parametric family

MVN(c , )
14.3.1
Conjugate Priors
Discrete likelihood
Likelihood
Bern (p)
Bin (p)
Conjugate prior
Beta (, )
Beta (, )
Posterior hyperparameters
+
+
n
X
i=1
n
X
xi , + n
xi , +
i=1
NBin (p)
Po ()
Beta (, )
Gamma (, )
+ rn, +
+
n
X
n
X
n
X
i=1
xi
Dir ()
n
X
xi , + n
x(i)
i=1
Geo (p)
Beta (, )
n
X
InverseWishart(, )
Pareto(xmc , k)
Gamma (, )
Pareto(xm , kc )
Pareto(x0 , k0 )
Gamma (c , )
Gamma (0 , 0 )
Pn

n
0
1
i=1 xi
+
+ 2 ,
/
2
2
02
c
0
c1
n
1
+ 2
02
c
Pn
02 + i=1 (xi )2
+ n,
+n

n
+ n
x
,
+ n,
+ ,
+n
2
n
1X
(
x )2
2
+
(xi x
) +
2 i=1
2(n + )
1
1
0 + nc
+ n, +
n
X
i=1
1

1
1
x
,
0 0 + n

1 1
1
0 + nc
n
X
n + , +
(xi c )(xi c )T
i=1
n
X
xi
xm c
i=1
x0 , k0 kn where k0 > kn
n
X
0 + nc , 0 +
xi
+ n, +
log
i=1
xi
i=1
Ni
n
X
14.4
xi
Bayesian Testing
If H0 : 0 :
i=1
Z
Prior probability P [H0 ] =
i=1
i=1
Multinomial(p)
Posterior hyperparameters

max x(n) , xm , k + n
n
X
+ n, +
xi
Posterior probability P [H0 | xn ] =
f () d
Z0
f ( | xn ) d
Let H0 , . . . , HK1 be K hypotheses. Suppose f ( | Hk ),

xi
f (xn | Hk )P [Hk ]
P [Hk | xn ] = PK
,
n
k=1 f (x | Hk )P [Hk ]
15
Marginal likelihood
f (xn | Hi ) =
1. Estimate VF [Tn ] with VFbn [Tn ].

2. Approximate VFbn [Tn ] using simulation:
f (xn | , Hi )f ( | Hi ) d
(a) Repeat the following B times to get Tn,1

, . . . , Tn,B
, an iid sample from
b
the sampling distribution implied by Fn
i. Sample uniformly X , . . . , Xn Fbn .
Posterior odds (of Hi relative to Hj )

P [Hi | xn ]
P [Hj | xn ]
f (xn | Hi )
f (xn | Hj )
| {z }
Bayes Factor BFij
P [Hi ]
P [Hj ]
| {z }
ii. Compute Tn = g(X1 , . . . , Xn ).

(b) Then
prior odds
Bayes factor
log10 BF10
p =
15
0 0.5
0.5 1
12
>2
p
1p BF10
1+
p
1p BF10
BF10
evidence
1 1.5
1.5 10
10 100
> 100
Weak
Moderate
Strong
Decisive
vboot
B
B
X
1 X
bb = 1
T
=V
T
n,b
Fn
B
B r=1 n,r
!2
b=1
16.1.1
where p = P [H1 ] and p = P [H1 | xn ]
Bootstrap Confidence Intervals
Normal-based interval
b boot
Tn z/2 se
Exponential Family
Pivotal interval
Scalar parameter
1.
2.
3.
4.
fX (x | ) = h(x) exp {()T (x) A()}

= h(x)g() exp {()T (x)}
Vector parameter
(
fX (x | ) = h(x) exp
s
X
Location parameter = T (F )
Pivot Rn = bn
Let H(r) = P [Rn r] be the cdf of Rn
Let Rn,b
= bn,b
bn . Approximate H using bootstrap:
B
1 X
b
H(r)
=
I(Rn,b
r)
B
i ()Ti (x) A()
i=1
b=1
= h(x) exp {() T (x) A()}

= h(x)g() exp {() T (x)}
Natural form
fX (x | ) = h(x) exp { T(x) A()}
= h(x)g() exp { T(x)}

= h(x)g() exp T T(x)
, . . . , bn,B
)
5. = sample quantile of (bn,1
6. r = sample quantile of (Rn,1

, . . . , Rn,B
), i.e., r = bn

7. Approximate 1 confidence interval Cn = a
, b where
a
=
b =
16
16.1
Sampling Methods
The Bootstrap
Let Tn = g(X1 , . . . , Xn ) be a statistic.

b 1 1 =
bn H
2
b 1 =
bn H
2
bn r1/2
=
2bn 1/2
bn r/2
=
2bn /2
Percentile interval

Cn = /2
, 1/2
16
16.2
Rejection Sampling
Setup
We can easily sample from g()
We want to sample from h(), but it is difficult
k()
k() d
Envelope condition: we can find M > 0 such that k() M g()
We know h() up to a proportional constant: h() = R
Algorithm
1. Draw cand g()
2. Generate u Unif (0, 1)
k(cand )
3. Accept cand if u
M g(cand )
4. Repeat until B values of cand have been accepted
Example
Loss functions
Squared error loss: L(, a) = ( a)2
(
K1 ( a) a < 0
Linear loss: L(, a) =
K2 (a ) a 0
Absolute error loss: L(, a) = | a| (linear loss with K1 = K2 )
Lp loss: L(, a) = | a|p
(
0 a=
Zero-one loss: L(, a) =
1 a 6=
17.1
We can easily sample from the prior g() = f ()

Target is the posterior h() k() = f (xn | )f ()
Envelope condition: f (xn | ) f (xn | bn ) = Ln (bn ) M
Algorithm
1. Draw cand f ()
2. Generate u Unif (0, 1)
Ln (cand )
3. Accept cand if u
Ln (bn )
16.3
Decision rule: synonymous for an estimator b

Action a A: possible value of the decision rule. In the estimation
b
context, the action is just an estimate of , (x).
Loss function L: consequences of taking action a when true state is or
b L : A [k, ).
discrepancy between and ,
Posterior risk
Z
h
i
b
b
L(, (x))f
( | x) d = E|X L(, (x))
h
i
b
b
L(, (x))f
(x | ) dx = EX| L(, (X))
r(b | x) =
(Frequentist) risk
b =
R(, )
Bayes risk
ZZ
Importance Sampling
b =
r(f, )
Sample from an importance function g rather than target density h.

Algorithm to obtain an approximation to E [q() | xn ]:
Decision Theory
Definitions
Unknown quantity affecting our decision:
h
i
b
b
L(, (x))f
(x, ) dx d = E,X L(, (X))
h
h
ii
h
i
b = E EX| L(, (X)
b
b
r(f, )
= E R(, )
h
h
ii
h
i
b = EX E|X L(, (X)
b
r(f, )
= EX r(b | X)
iid
1. Sample from the prior 1 , . . . , n f ()

Ln (i )
2. wi = PB
i = 1, . . . , B
i=1 Ln (i )
PB
3. E [q() | xn ] i=1 q(i )wi
17
Risk
17.2
Admissibility
b0 dominates b if
b
: R(, b0 ) R(, )
b
: R(, b0 ) < R(, )
b is inadmissible if there is at least one other estimator b0 that dominates

it. Otherwise it is called admissible.
17
17.3
Bayes Rule
Residual sums of squares (rss)
Bayes rule (or Bayes estimator)

rss(b0 , b1 ) =
b = inf e r(f, )
e
r(f, )
R
b
b
b = r(b | x)f (x) dx
(x) = inf r( | x) x = r(f, )
n
X
2i
i=1
Least square estimates

Theorems
bT = (b0 , b1 )T : min rss

b0 ,
b1
Squared error loss: posterior mean

Absolute error loss: posterior median
Zero-one loss: posterior mode
17.4
n
b0 = Yn b1 X
Pn
Pn
n )(Yi Yn )
(Xi X
i=1 Xi Yi nXY
Pn
b1 = i=1
=
P
n
2
2
2
i=1 (Xi Xn )
i=1 Xi nX
h
i
0
E b | X n =
1

P
h
i
2 n1 ni=1 Xi2 X n
n
b
V |X =
X n
1
nsX
r Pn
2
b
i=1 Xi
b b0 ) =
se(
n
sX n
b b1 ) =
se(
sX n
Minimax Rules
Maximum risk
b = sup R(, )
b
)
R(
R(a)
= sup R(, a)
Minimax rule
e
e = inf sup R(, )
b = inf R(
)
sup R(, )
b =c
b = Bayes rule c : R(, )
Least favorable prior
bf = Bayes rule R(, bf ) r(f, bf )
18
Linear Regression
Pn
b2 =
where s2X = n1 i=1 (Xi X n )2 and
Further properties:
Pn
2i
i=1
(unbiased estimate).
P
P
Consistency: b0 0 and b1 1
Definitions
Response variable Y
Covariate X (aka predictor variable or feature)
18.1
1
n2
b0 0 D
N (0, 1)
b b0 )
se(
Simple Linear Regression
b1 1 D
N (0, 1)
b b1 )
se(
Approximate 1 confidence intervals for 0 and 1 :
Model
Yi = 0 + 1 Xi + i
and
E [i | Xi ] = 0, V [i | Xi ] = 2
b b0 )
b0 z/2 se(
b b1 )
and b1 z/2 se(
Fitted line
Wald test for H0 : 1 = 0 vs. H1 : 1 6= 0: reject H0 if |W | > z/2 where
b b1 ).
W = b1 /se(
rb(x) = b0 + b1 x
Predicted (fitted) values
Ybi = rb(Xi )
Residuals
i = Yi Ybi = Yi b0 + b1 Xi
R2

Pn b
Pn 2
2

rss
i=1 (Yi Y )
R = Pn
= 1 Pn i=1 i 2 = 1
2
tss
i=1 (Yi Y )
i=1 (Yi Y )
2
18
If the (k k) matrix X T X is invertible,
Likelihood
L=
L1 =
n
Y
i=1
n
Y
f (Xi , Yi ) =
n
Y
fX (Xi )
i=1
n
Y
b = (X T X)1 X T Y
h
i
V b | X n = 2 (X T X)1
fY |X (Yi | Xi ) = L1 L2
i=1
b N , 2 (X T X)1
fX (Xi )
Estimate regression function
i=1
n
Y
2
1 X
Yi (0 1 Xi )
fY |X (Yi | Xi ) n exp 2
L2 =
2 i
i=1
)
rb(x) =
k
X
bj xj
j=1
Under the assumption of Normality, the least squares estimator is also the mle
Unbiased estimate for
n
1 X 2

b =
n k i=1 i
1X 2

b2 =
n i=1 i
= X b Y
mle
18.2
b=X
Prediction
Observe X = x of the covariate and want to predict their outcome Y .

Yb = b0 + b1 x
i
h i
h i
h
i
V Yb = V b0 + x2 V b1 + 2x Cov b0 , b1
h
Prediction interval
bn2 =
b2

Pn
2
i=1 (Xi X )
P
2j + 1
n i (Xi X)
nk 2
1 Confidence interval
b bj )
bj z/2 se(
18.4
Model Selection
Consider predicting a new observation Y for covariates X and let S J

denote a subset of the covariates in the model, where |S| = k and |J| = n.
Issues
Underfitting: too few covariates yields high bias
Overfitting: too many covariates yields high variance
Yb z/2 bn
18.3
b2 =
Procedure
1. Assign a score to each model
2. Search through all models to find the one with the highest score
Multiple Regression
Y = X +
Hypothesis testing
H0 : j = 0 vs. H1 : j 6= 0 j J
where
X11
..
X= .
Xn1
..
.
X1k
..
.
Xnk
1

= ...

1
..
=.
n
Mean squared prediction error (mspe)

h
i
mspe = E (Yb (S) Y )2
Prediction risk
Likelihood

1
2 n/2
L(, ) = (2 )
exp 2 rss
2

R(S) =
n
X
mspei =
i=1
n
X
h
i
E (Ybi (S) Yi )2
i=1
Training error
rss = (y X)T (y X) = kY Xk2 =
N
X
(Yi xTi )2
i=1
btr (S) =
R
n
X
(Ybi (S) Yi )2
i=1
19
R2
btr (S)
rss(S)
R
R2 (S) = 1
=1
=1
tss
tss
Pn b
2
i=1 (Yi (S) Y )
P
n
2
i=1 (Yi Y )
Frequentist risk
Z
h
i Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
The training error is a downward-biased estimate of the prediction risk.

h
i
btr (S) < R(S)
E R
h
i
btr (S)) = E R
btr (S) R(S) = 2
bias(R
n
X
h
i
b(x) = E fbn (x) f (x)
h
i
v(x) = V fbn (x)
h
i
Cov Ybi , Yi
i=1
Adjusted R
R2 (S) = 1
n 1 rss
n k tss
19.1.1
Mallows Cp statistic
Histograms
Definitions
b
btr (S) + 2kb
R(S)
=R
2 = lack of fit + complexity penalty
Akaike Information Criterion (AIC)

AIC(S) = `n (bS ,
bS2 ) k
Bayesian Information Criterion (BIC)
Number of bins m
1
Binwidth h = m
Bin Bj has j observations
R
Define pbj = j /n and pj = Bj f (u) du
Histogram estimator
k
BIC(S) = `n (bS ,
bS2 ) log n
2
Validation and training
bV (S) =
R
m
X
fbn (x) =
(Ybi (S) Yi )2
m = |{validation data}|, often
i=1
Leave-one-out cross-validation
bCV (S) =
R
n
X
(Yi Yb(i) ) =
i=1
n
X
i=1
Yi Ybi (S)
1 Uii (S)
!2
U (S) = XS (XST XS )1 XS (hat matrix)
19
19.1
Non-parametric Function Estimation
n
n
or
4
2
m
X
pbj
j=1
I(x Bj )
h
i p
j
E fbn (x) =
h
h
i p (1 p )
j
j
V fbn (x) =
nh2
Z
1
h2
2
b
(f 0 (u)) du +
R(fn , f )
12
nh
!1/3
1
6
h = 1/3 R
2 du
n
(f 0 (u))
2/3 Z
1/3
3
C
2
C=
(f 0 (u)) du
R (fbn , f ) 2/3
4
n
Density Estimation
R
Estimate f (x), where f (x) = P [X A] = A f (x) dx.
Integrated square error (ise)
Z
Z
2
L(f, fbn ) =
f (x) fbn (x) dx = J(h) + f 2 (x) dx
Cross-validation estimate of E [J(h)]

Z
JbCV (h) =
2Xb
2
n+1 X 2
f(i) (Xi ) =
pb
fbn2 (x) dx
n i=1
(n 1)h (n 1)h j=1 j
20
19.1.2
Kernel Density Estimator (KDE)
k-nearest Neighbor Estimator

X
1
Yi
where Nk (x) = {k values of x1 , . . . , xn closest to x}
rb(x) =
k
Kernel K
i:xi Nk (x)
K(x) 0
R
K(x) dx = 1
R
xK(x) dx = 0
R 2
2
>0
x K(x) dx K
Nadaraya-Watson Kernel Estimator

rb(x) =
n
X
wi (x)Yi
i=1
KDE
xxi

h
Pn
xxj
K
j=1
h
wi (x) =

n
1X1
x Xi
fbn (x) =
K
n i=1 h
h
Z
Z
1
1
4
00
2
b
R(f, fn ) (hK )
(f (x)) dx +
K 2 (x) dx
4
nh
Z
Z
2/5 1/5 1/5
c
c2 c3
2
2
h = 1
c
=
,
c
=
K
(x)
dx,
c
=
(f 00 (x))2 dx
1
2
3
K
n1/5
Z
4/5 Z
1/5
c4
5 2 2/5
2
00 2
b
R (f, fn ) = 4/5
K (x) dx
(f ) dx
c4 = (K )
4
n
|
{z
}
R(b
rn , r)
h4
4
Z
[0, 1]
4 Z
2
f 0 (x)
x2 K 2 (x) dx
r00 (x) + 2r0 (x)
dx
f (x)
R
2 K 2 (x) dx
dx
nhf (x)
Z
c1
n1/5
c2
R (b
rn , r) 4/5
n
h
C(K)
Epanechnikov Kernel
(
K(x) =
4 5(1x2 /5)
|x| <
otherwise
JbCV (h) =
n
X
(Yi rb(i) (xi ))2 =
i=1
n
X
i=1
(Yi rb(xi ))2

1
Pn
j=1

Z
JbCV (h) =
K (x) = K (2) (x) 2K(x)
19.2
19.3

n
n
n
2Xb
1 X X Xi Xj
2
2
b
fn (x) dx
f(i) (Xi )
+
K(0)
K
n i=1
hn2 i=1 j=1
h
nh
K (2) (x) =
Approximation
r(x) =
X
j=1
j j (x)
J
X
j j (x)
i=1
Multivariate regression
Non-parametric Regression
Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points

(x1 , Y1 ), . . . , (xn , Yn ) related by
K(0)
xx
j
K
h
Smoothing Using Orthogonal Functions
Z
K(x y)K(y) dy
!2
where
i = i
Y = +
0 (x1 )
..
and = .
..
.
0 (xn )
J (x1 )
..
.
J (xn )
Least squares estimator

Yi = r(xi ) + i
E [i ] = 0
V [i ] = 2
b = (T )1 T Y
1
T Y (for equally spaced observations only)
n
21
2
n
J
X
X
bCV (J) =
Yi
R
j (xi )bj,(i)
i=1
20
j=1
20.2
Poisson Processes
Poisson process
{Xt : t [0, )} = number of events up to and including time t
X0 = 0
Independent increments:
Stochastic Processes
Stochastic Process
(
{0, 1, . . . } = Z discrete
T =
[0, )
continuous
{Xt : t T }
Notations Xt , X(t)
State space X
Index set T
20.1
t0 < < tn : Xt1 Xt0

Xtn Xtn1
Intensity function (t)
P [Xt+h Xt = 1] = (t)h + o(h)
P [Xt+h Xt = 2] = o(h)
Xs+t Xs Po (m(s + t) m(s)) where m(t) =
Markov Chains
Markov chain
Rt
0
(s) ds
Homogeneous Poisson process
P [Xn = x | X0 , . . . , Xn1 ] = P [Xn = x | Xn1 ]
n T, x X
(t) = Xt Po (t)
Transition probabilities
pij P [Xn+1 = j | Xn = i]
pij (n) P [Xm+n = j | Xm = i]
>0
Waiting times
n-step
Wt := time at which Xt occurs
Transition matrix P (n-step: Pn )

(i, j) element is pij
pij > 0
P
i pij = 1

1
Wt Gamma t,
Chapman-Kolmogorov
Interarrival times
pij (m + n) =
pij (m)pkj (n)
St = Wt+1 Wt
Pm+n = Pm Pn
Pn = P P = Pn
St Exp

1
Marginal probability
n = (n (1), . . . , n (N ))
where
i (i) = P [Xn = i]
St
0 , initial distribution
n = 0 Pn
Wt1
Wt
t
22
21
Time Series
21.1
Mean function
Strictly stationary
xt = E [xt ] =
Stationary Time Series
xft (x) dx
P [xt1 c1 , . . . , xtk ck ] = P [xt1 +h c1 , . . . , xtk +h ck ]
Autocovariance function
x (s, t) = E [(xs s )(xt t )] = E [xs xt ] s t

x (t, t) = E (xt t )2 = V [xt ]
Autocorrelation function (ACF)
k N, tk , ck , h Z
Weakly stationary

t Z
E x2t <
2
E xt = m
t Z
x (s, t) = x (s + r, t + r)
(s, t)
Cov [xs , xt ]
=p
(s, t) = p
V [xs ] V [xt ]
(s, s)(t, t)
Cross-covariance function (CCV)
r, s, t Z
Autocovariance function
xy (s, t) = E [(xs xs )(yt yt )]
Cross-correlation function (CCF)

xy (s, t) = p
xy (s, t)
x (s, s)y (t, t)
(h) = E [(xt+h )(xt )]

(0) = E (xt )2
(0) 0
(0) |(h)|
(h) = (h)
h Z
Backshift operator
B k (xt ) = xtk
Autocorrelation function (ACF)
Difference operator
(t + h, t)
(h)
Cov [xt+h , xt ]
x (h) = p
=p
=
(0)
V [xt+h ] V [xt ]
(t + h, t + h)(t, t)
d = (1 B)d
White noise
2
wt wn(0, w
)
Jointly stationary time series
iid
2
0, w
Gaussian: wt N
E [wt ] = 0 t T
V [wt ] = 2 t T
w (s, t) = 0 s 6= t s, t T
xy (h) = E [(xt+h x )(yt y )]

xy (h) = p
Random walk
Linear process
Drift
Pt
xt = t + j=1 wj
E [xt ] = t
xt = +
k
X
j=k
aj xtj
j wtj
where
j=
Symmetric moving average

mt =
xy (h)
x (0)y (h)
where aj = aj 0 and
k
X
j=k
aj = 1
(h) =
|j | <
j=
2
w
j+h j
j=
23
21.2
Estimation of Correlation
21.3.1
Detrending
Least squares
Sample mean
n
1X
x
=
xt
n t=1
Sample variance

n
|h|
1 X
1
x (h)
V [
x] =
n
n
h=n
1. Choose trend model, e.g., t = 0 + 1 t + 2 t2

2. Minimize rss to obtain trend estimate
bt = b0 + b1 t + b2 t2
3. Residuals , noise wt
Moving average
The low-pass filter vt is a symmetric moving average mt with aj =
Sample autocovariance function

nh
1 X
b(h) =
(xt+h x
)(xt x
)
n t=1
vt =
1
2k+1 :
k
X
1
xt1
2k + 1
i=k
Pk
1
If 2k+1
i=k wtj 0, a linear trend function t = 0 + 1 t passes
without distortion
Sample autocorrelation function

b(h) =
b(h)
b(0)
Differencing
t = 0 + 1 t = xt = 1
Sample cross-variance function
bxy (h) =
nh
1 X
(xt+h x
)(yt y)
n t=1
21.4
ARIMA models
Autoregressive polynomial
(z) = 1 1 z p zp
Sample cross-correlation function
bxy (h)
bxy (h) = p
bx (0)b
y (0)
z C p 6= 0
Autoregressive operator
(B) = 1 1 B p B p
Properties
1
bx (h) = if xt is white noise
n
1
bxy (h) = if xt or yt is white noise
n
21.3
Non-Stationary Time Series
Autoregressive model order p, AR (p)

xt = 1 xt1 + + p xtp + wt (B)xt = wt
AR (1)
xt = k (xtk ) +
k1
X
j (wtj )
k,||<1
j=0
Classical decomposition model
j (wtj )
j=0
{z
linear process
xt = t + st + wt
t = trend
st = seasonal component
wt = random noise term
E [xt ] =
j=0
j (E [wtj ]) = 0
(h) = Cov [xt+h , xt ] =

(h) =
(h)
(0)
2 h
w
12
= h
(h) = (h 1) h = 1, 2, . . .
24
Moving average polynomial
Seasonal ARIMA
(z) = 1 + 1 z + + q zq
z C q 6= 0
Moving average operator

(B) = 1 + 1 B + + p B p
21.4.1
MA (q) (moving average model order q)

xt = wt + 1 wt1 + + q wtq xt = (B)wt
E [xt ] =
q
X
Denoted by ARIMA (p, d, q) (P, D, Q)s

d
s
P (B s )(B)D
s xt = + Q (B )(B)wt
Causality and Invertibility
ARMA (p, q) is causal (future-independent) {j } :

xt =
j E [wtj ] = 0
(h) = Cov [xt+h , xt ] =
2
w
0
Pqh
j=0
j j+h
j=0
j < such that
wtj = (B)wt
j=0
j=0
0hq
h>q
ARMA (p, q) is invertible {j } :
MA (1)
xt = wt + wt1
2 2
(1 + )w h = 0
2
(h) = w
h=1
0
h>1
(
h=1
2
(h) = (1+ )
0
h>1
(B)xt =
j=0
j < such that
Xtj = wt
j=0
Properties
ARMA (p, q) causal roots of (z) lie outside the unit circle
(z) =
j z j =
j=0
(z)
(z)
|z| 1
ARMA (p, q)
xt = 1 xt1 + + p xtp + wt + 1 wt1 + + q wtq
ARMA (p, q) invertible roots of (z) lie outside the unit circle
(B)xt = (B)wt
(z) =
Partial autocorrelation function (PACF)
j z j =
j=0
xh1
, regression of xi on {xh1 , xh2 , . . . , x1 }
i
hh = corr(xh xh1
, x0 xh1
) h2
0
h
E.g., 11 = corr(x1 , x0 ) = (1)
(z)
(z)
|z| 1
Behavior of the ACF and PACF for causal and invertible ARMA models
ACF
PACF
ARIMA (p, d, q)
d xt = (1 B)d xt is ARMA (p, q)
AR (p)
tails off
cuts off after lag p
MA (q)
cuts off after lag q
tails off q
ARMA (p, q)
tails off
tails off
(B)(1 B)d xt = (B)wt

Exponentially Weighted Moving Average (EWMA)
xt = xt1 + wt wt1
xt =
(1 )j1 xtj + wt
when || < 1
21.5
Spectral Analysis
Periodic process
xt = A cos(2t + )
= U1 cos(2t) + U2 sin(2t)
j=1
x
n+1 = (1 )xn +
xn
Frequency index (cycles per unit time), period 1/
25
Amplitude A
Phase
U1 = A cos and U2 = A sin often normally distributed rvs
Discrete Fourier Transform (DFT)

d(j ) = n1/2
xt =
q
X
Fourier/Fundamental frequencies
(Uk1 cos(2k t) + Uk2 sin(2k t))
j = j/n
k=1
Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rvs with variances k2

Pq
(h) = k=1 k2 cos(2k h)
Pq
(0) = E x2t = k=1 k2
Inverse DFT
xt = n1/2
j=0
Scaled Periodogram
2 2i0 h 2 2i0 h
e
+
e
=
2
2
Z 1/2
=
e2ih dF ()
4
I(j/n)
n
!2
n
2X
=
xt cos(2tj/n +
n t=1
P (j/n) =
1/2
Spectral distribution function

< 0
< 0
0
22
22.1
(h)e2ih
h=
h=
!2
Gamma Function
ts1 et dt
0
Z
Upper incomplete: (s, x) =
ts1 et dt
x
Z x
Lower incomplete: (s, x) =
ts1 et dt
Ordinary: (s) =
Spectral density
2X
xt sin(2tj/n
n t=1
Math
Z
F () = F (1/2) = 0
F () = F (1/2) = (0)
f () =
d(j )e2ij t
I(j/n) = |d(j/n)|2
(h) = 2 cos(20 h)
0
F () = 2 /2
n1
X
Periodogram
Spectral representation of a periodic process
xt e2ij t
i=1
Periodic mixture
Needs
n
X
|(h)| < = (h) =
1
1
2
2
R 1/2
1/2
e2ih f () d
f () 0
f () = f ()
f () = f (1 )
R 1/2
(0) = V [xt ] = 1/2 f () d
h = 0, 1, . . .
( + 1) = ()
>1
(n) = (n 1)!
nN
(1/2) =
22.2
Beta Function
Z 1
(x)(y)
tx1 (1 t)y1 dt =
Ordinary: B(x, y) = B(y, x) =
(x + y)
0
Z x
Incomplete: B(x; a, b) =
ta1 (1 t)b1 dt
2
White noise: fw () = w
ARMA (p, q) , (B)xt = (B)wt :
|(e2i )|2
fx () =
|(e2i )|2
Pp
Pq
where (z) = 1 k=1 k z k and (z) = 1 + k=1 k z k
2
w
Regularized incomplete:
a+b1
B(x; a, b) a,bN X
(a + b 1)!
Ix (a, b) =
=
xj (1 x)a+b1j
B(a, b)
j!(a
+
b
j)!
j=a
26
Stirling numbers, 2nd kind

n
n1
n1
=k
+
k
k
k1
I0 (a, b) = 0
I1 (a, b) = 1
Ix (a, b) = 1 I1x (b, a)
22.3
Series
Finite
n(n + 1)
k=
2
(2k 1) = n2
k=1
n
X
k=1
n
X
k=1
n
X
k2 =
ck =
k=0
cn+1 1
c1
n
X
n
k=0
n
X
=2
k=0
Balls and Urns

|B| = n, |U | = m
B : D, U : D
Binomial
Theorem:
n
X
n nk k
a
b = (a + b)n
k
B : D, U : D
k=0
c 6= 1
k > n : Pn,k = 0
f :BU
f arbitrary
n
m

B : D, U : D

n+n1
n
m
X
n
k=1
1
,
1p
p
|p| < 1
1p
k=1
!

X
d
1
1
k
=
p
=
dp 1 p
(1 p)2
pk =
kpk1 =
B : D, U : D
|p| < 1
ordered
(n i) =
i=0
unordered
Pn,k
f injective
(
mn m n
0
else

m
n
(
1 mn
0 else
(
1 mn
0 else
f surjective

n
m!
m

n1
m1

n
m
Pn,m
f bijective
(
n! m = n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
References
[3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications
With R Examples. Springer, 2006.
w/o replacement
nk =
D = distinguishable, D = indistinguishable.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.

The American Statistician, 62(1):4553, 2008.
Sampling
k1
Y
n 1 : Pn,0 = 0, P0,0 = 1
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.

Brooks Cole, 1972.
Combinatorics
k out of n
m
X
k=1
k=0
22.4
Pn,i
k=0
d
dp
k=0
k=0

X
r+k1 k
x = (1 x)r r N+
k
k=0

X
k
p = (1 + p) |p| < 1 , C
k
n
X
Vandermondes Identity:

r
X
m
n
m+n
=
k
rk
r
k=0
Infinite
pk =
Pn+k,k =
i=1

r+k
r+n+1
=
k
n
k=0

n
X k
n+1
=
m
m+1
n(n + 1)(2n + 1)
6
k=1
2

n
X
n(n + 1)
k3 =
2
Partitions
Binomial
n
X
1kn
(
1 n=0
n
=
0
0 else
n!
(n k)!

n
nk
n!
=
=
k
k!
k!(n k)!
w/ replacement
[4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie,

Algebra. Springer, 2001.
nk
[5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und

Statistik. Springer, 2002.

n1+r
n1+r
=
r
n1
[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.

Springer, 2003.
27
28
Univariate distribution relationships, courtesy Leemis and McQueston [2].

Probability and Statistics - Cookbook

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probability and Statistics - Cookbook

Uploaded by

Copyright:

Available Formats

c Matthias Vallentin, 2011

I(a < x < b)

I(a < x < b)

(2)k/2 ||1/2 e 2 (x)

(1 2s)k/2 s < 1/2

(d1 x+d2 )d1 +d2

Law of Total Probability

ii1 <<ir n j=1

Random Variable (RV)

fX (x) = P [X = x] = P [{ : X() = x}]

Cumulative Distribution Function (CDF)

1. Nondecreasing: x1 < x2 = F (x1 ) F (x2 )

xyfX,Y (x, y) dFX (x) dFY (y)

Special case if strictly monotone

The Rule of the Lazy Statistician

IA (x) dFX (x) =

(y, z)f(Y,Z)|X (y, z | x) dy dz

(x) dFX (x)

Definition and properties

Definition and properties

Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]

Cov [X + a, Y + b] = Cov [X, Y ]

limn Bin (n, p) = Po (np)

X NBin (1, p) = Geo (p)

Memoryless property: P [X > x + y | X > y] = P [X > x]

E [(X)] (E [X]) convex

X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n + m, p)

Probability and Moment Generating Functions

Conditional mean and variance

V [X] = G00X (1 ) + G0X (1 ) (G0X (1 ))

Standard Bivariate Normal

1. In distribution (weakly, in law): Xn X

lim Fn (t) = F (t)

( > 0) lim P [|Xn X| > ] = 0

3. Almost surely (strongly): Xn X

4. In quadratic mean (L2 ): Xn X

Law of Large Numbers (LLN)

Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] < .

Let X1 , , Xn F if not otherwise noted.

Point estimator bn of is a rv: bn = g(X1 , . . . , Xn )

Central Limit Theorem (CLT)

Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] = 2 .

limn bias(bn ) = 0 limn se(bn ) = 0 = bn is consistent

Normal-Based Confidence Interval

Empirical Distribution Function (ECDF)

Properties (for any fixed x)

pth quantile: F 1 (p) = inf{x : F (x) p}

Nonparametric 1 confidence band for F

P [L(x) F (x) U (x) x] 1

Method of moments estimator (MoM)

Properties of the MoM estimator

Equivariance: bn is the mle = (bn ) ist the mle of ()

Approximately the Bayes estimator

b where is differentiable and 0 () 6= 0:

Maximum likelihood estimator (mle)

b is the mle of and

Observed Fisher information

Let = (1 , . . . , k ) and b = (b1 , . . . , bk ) be the mle.

Fisher information matrix

Under appropriate regularity conditions

with Jn () = In1 . Further, if bj is the j th component of , then

Multiparameter delta method