The Problem: Library (MASS) Data (Faithful) Attach (Faithful)

36-724 Old Faithful E-M Example
The Problem
> library(MASS)
> data(faithful)
> attach(faithful)
denes a vector waiting of n = 272 waiting times between eruptions of the geyser Old Faithful
at Yellowstone Park. [A related data set, geyser is discussed by Venables and Ripley (MASS 3rd
ed, pp 252; 4th ed. pp. 127)].
A histogram of the data in waiting looks like this:
Histogram of waiting
waiting
F
r
e
q
u
e
n
c
y
40 50 60 70 80 90 100
0
1
0
2
0
3
0
4
0
5
0
and suggests that a plausible model for the waiting times y
1
, . . . , y
n
is a mixture of two normals:
y
i
|
1
,
2
,
1
,
2
, q
iid
_
q
1
2
1
exp[(y
i

1
)
2
/2
2
1
]
+ (1 q)
1
2
2
exp[(y
i

2
)
2
/2
2
2
]
_
()
We wish to estimate the parameters
1
,
2
,
1
,
2
and q.
Letting f
1
(y
i
|
1
,
1
) =
1
2
1
exp[(y
i
1
)
2
/2
2
1
], and f
2
(y
i
|
2
,
2
) =
1
2
2
exp[(y
i
2
)
2
/2
2
2
],
it is clear that the observed data likelihood is
g(y|
1
,
1
,
2
,
2
, q) =
n
_
i=1
{q f
1
(y
i
|
1
,
1
) + (1 q) f
2
(y
i
|
2
,
2
)}.
Data augmentation
This likelihood is greatly simplied by using data augmentation (why it is simpler will only become
clear later). Consider the latent variable (a.k.a. missing data, or augmentation data)
z
i
=
_
1 if y
i
belongs to the left hump
0 if y
i
belongs to the right hump
, i = 1, . . . , n .
1
If we assume
z
i
| q
iid
Bin(1, q)
y
i
|
1
,
2
,
1
,
2
, z
i
indep
_
N(
1
,
2
1
), if z
i
= 1
N(
2
,
2
2
), if z
i
= 0
then
f (y
i
, z
i
|
1
,
1
,
2
,
2
, q) = f (y
i
|
1
,
1
,
2
,
2
, z
i
, q) f (z
i
|q)
Therefore the complete data likelihood is
f (y, z|
1
,
1
,
2
,
2
, q)
=
n
_
i=1
_
1
2
1
exp[(y
i

i
)
2
/2
2
1
]
_
z
i
_
1
2
2
exp[(y
i

2
)
2
/2
2
2
]
_
1z
i
q
z
i
(1 q)
1z
i
=
n
_
i=1
_
q
1
2
1
exp[(y
i

1
)
2
/2
2
1
]
_
z
i
_
(1 q)
1
2
2
exp[(y
i

2
)
2
/2
2
2
]
_
1z
i
.
where n = 272 and z
+
=
_
n
i=1
z
i
= (number of observations in the left hump). That is,
f (y, z|
1
,
2
,
1
,
2
, q) =
n
_
i=1
{q f
1
(y
i
|
1
,
1
)}
z
i
{(1 q) f
2
(y
i
|
2
,
2
)}
1z
i
E-M Algorithm(s)
To derive the E-M algorithm for this data, we must compute Q(,
k
) = E[log f (X|)|Y,
k
], where
= (q,
1
,
2
,
2
1
,
2
2
). Since
logf (y, z|) =
n
i=1
_
z
i
logq f
1
(y
i
|
1
,
1
) + (1 z
i
)log(1 q) f
2
(y
i
|
2
,
2
)
_
=
n
i=1
_
log
_
q f
1
(y
i
|
1
,
1
)
(1 q) f
2
(y
i
|
2
,
2
)
_
z
i
+ log[(1 q) f
2
(y
i
|
2
,
2
)]
_
we know we will have to compute
z
k
i
= E[z
i
|y
i
,
k
] = P(z
i
= 1|y
i
,
k
) =
f (z
i
= 1, y
i
|
k
)
f (z
i
= 1, y
i
|
k
) + f (z
i
= 0, y
i
|
k
)
=
q
k
f
1
(y
i
|
k
1
,
k
1
)
q
k
f
1
(y
i
|
k
1
,
k
2
) + (1 q
k
) f
2
(y
i
|
k
2
,
k
2
)
This leads to
E-step: Compute
Q(,
k
) = E[logf (y, z|)|y,
k
] =
n
i=1
_
z
k
i
logq f
1
(y
i
|
1
,
1
) + (1 z
k
i
)log(1 q) f
2
(y
i
|
2
,
2
)
_
2
M-step: compute
k+1
= argmax
Q(,
k
)
We can implement this E-M algorithm as it stands (see below), but a simplication is possible.
If we round each z
k
i
to zero or one, then Q(,
k
) looks like the sum of two normal log-likelihoods,
Q(,
k
) =
i: z
k
i
=0
log q f
1
(y
i
|
1
,
1
) +
i: z
k
i
=1
log(1 q) f
2
(y
i
|
2
,
2
)
where z
k
i
= 0 assigns some of the y
i
s to the left-hump likelihood and z
k
i
= 1 assigns the rest of
the y
i
s to the right-hump likelihood. This suggests a modied E-M algorithm as follows:
E-step: Compute z
k
i
rounded to 0 or 1 as above.
M-step: Estimate
n
k
1
=
_
n
i=1
z
k
i
n
k
2
=
_
n
i=1
(1 z
k
i
)
k+1
1
=
1
n
k
1
_
i
z
k
i
y
i

k+1
2
=
1
n
k
2
_
i
(1 z
k
i
)y
i
2
1
(k+1)
=
1
n
k
1
_
i
z
k
i
(y
i

k+1
1
)
2
2
1
(k+1)
=
1
n
k
2
_
i
(1 z
k
i
)(y
i

k+1
2
)
2
q
k+1
= n
k
1
/n
That is, each
i
or
2
i
is computed from the y
i
s assigned to the corresponding hump, and
q is the proportion of y
i
s assigned to the left hump.
Implementation
Point Estimates
Full E-M
Following is an implementation of the full E-M algorithm. The results are essentially identical
with the simpler E-M algorithm that comes from rounding the z
k
i
s.
The histogram suggests that (
1
= 55,
1
= 5,
2
= 80,
2
= 5, q = 0.5) might be a good set of
starting points. The following program stops when the log-likelihood is close enough.
library(MASS)
attach(faithful)
sqrt2PI <- sqrt(2*3.1415926535)
y <- waiting
### define th[1] = mu_1, th[2] = sigma_1,
### th[3] = mu_2, th[4] = sigma_2, th[5] = q
3
normf <- function(y, mu, sigma) { # normal density
1/(sqrt2PI*sigma)*exp(-(y-mu)2/(2*sigma2)) # (could use dnorm!)
}
lognormf <- function(y, mu, sigma) { # log of normal density
-(y-mu)2/(2*sigma2) - log(sqrt2PI*sigma) # (could use dnorm(...,log=T)!)
}
llg <- function(th) { # log-likelihood
sum( log(th[5]*normf(y, th[1], th[2]) +
(1-th[5])*normf(y, th[3], th[4])) )
}
NQ <- function(th){ #E-step: Negative Q
Ez <- th0[5]*normf(y,th0[1],th0[2]) /
(th0[5]*normf(y,th0[1],th0[2]) +
(1-th0[5])*normf(y,th0[3],th0[4]))
-sum( Ez*(log(th[5])+lognormf(y,th[1],th[2]))+
(1-Ez)*(log(1-th[5])+lognormf(y,th[3],th[4])) )
}
th0 <- c(60, 10, 90, 10, 0.5) # a good guess
# th0 <- c(100, 1, 30, 1, 0.9) # a wild guess
# th0 <- c(65, 10, 65, 10, 0.5) # leads to a local maximum/saddle point
i <- 0
while(1) {
print(c(i, th0, llg(th0)))
# th1 <- nlmin(NQ, th0)$x #Splus: M-step: Minimize -Q
th1 <- nlm(NQ,th0)$estimate #R: M-step: Minimize -Q
if( (i>1000) || (abs(llg(th0) - llg(th1))<(1e-9)*abs(llg(th0))))
break
th0 <- th1
i <- i+1
}
There is some evidence of multiple maxima and/or saddle points for this likelihood (as we would
expect!). Here is a summary of the runs:
4
Starting Values #iter

(60, 10, 90, 10, 0.5) 27 (54.62, 5.87, 80.09, 5.87, 0.36)
(100, 1, 30, 1, 0.9) 15 (80.09, 5.87, 54.61, 5.87, 0.64)
(65, 10, 65, 10, 0.5) 1 (70.90, 13.57, 70.89, 13.57, 0.50)
Simplied E-M
Here is the code for the simplied E-M algorithm that we get by rounding the zs:
th0 <- c(60, 10, 90, 10, 0.5) # a good guess
# th0 <- c(100, 1, 30, 1, 0.9) # a wild guess
# th0 <- c(65, 10, 65, 10, 0.5) # leads to a local maximum/saddle point
i <- 0
th1 <- th0
while(1) {
print(c(i,th0,llg(th0)))
Ez <- th0[5]*normf(y,th0[1],th0[2]) /
(th0[5]*normf(y,th0[1],th0[2]) +
(1-th0[5])*normf(y,th0[3],th0[4]))
z <- round(Ez)
n <- length(z)
n0 <- sum(z)
n1 <- n- n0
th1[1] <- sum(z*y)/n0
th1[3] <- sum((1-z)*y)/n1
th1[2] <- sqrt (sum(z*(y-th1[1])2)/n0)
th1[4] <- sqrt (sum((1-z)*(y-th1[3])2)/n1)
th1[5] <- n0/n
if( (i>1000) || (abs(llg(th0) - llg(th1))<(1e-9)*abs(llg(th0))))
break
th0 <- th1
i <- i+1
}
5
and here are the results of running the full and partial E-M algorithmfrom each of the given starting
values:
Starting Full E-M Simplied E-M
Values #iter

# iter

(60, 10, 90, 10, 0.5) 27 (54.62, 5.87, 80.09, 5.87, 0.36) 5 (54.75, 5.87, 80.28, 5.61, 0.37)
(100, 1, 30, 1, 0.9) 15 (80.09, 5.87, 54.61, 5.87, 0.64) 3 (80.21, 5.68, 54.63, 5.76, 0.64)
(65, 10, 65, 10, 0.5) 1 (70.90, 13.57, 70.89, 13.57, 0.50) 1 ( NaN, NaN, 70.90, 13.57, 0.00)
Not only does the simplied E-M algorithm run in fewer iterations (why??) but if you run the
two algorithms yourself you will see that each step of the simplied algorithm takes much less
time than a step of the full E-M algorithm (why??). Since the results of the two algorithms are
dierent, only one can be producing MLEs. . . Which one is it? Why??
Aside: In many practical optimization problems you will have a choice of two algorithms, say,
algorithm A that is guaranteed to get to the right answer quickly but may be unstable if the initial
guess is not good, or perhaps A runs very slowly, and algorithm B that is more stable and/or faster,
but is not guaranteed to get to the right answer. A common strategy in such situations is to run
algorithm B to get into the ballpark of the correct answer, and then switch to algorithmA to polish
things up. In some problems, Newton-Raphson plays the role of A (fast but unstable), and E-M
plays the role of B (slow but more stable); or perhaps E-M plays the role of A (slow but guaranteed
to get to the right answer), and an approximate E-M algorithm plays the role of B (faster but not
guaranteed to get to the right answer). In other situations, there may be other choices for A and B.
Standard Errors
To calculate standard errors for the MLEs (
1
= 80.09,
1
= 5.87,
2
= 54.62,
2
= 5.87, q =
0.64), we can use either the missing information principle or do numerical dierentiation.
To compute standard errors using the Missing Information Principle,
() = E[
2
log f (x|)|y, ] Var [
log f (x|)|y, ], ()
we know that we only need the main diagonal terms of each of the two matrices on the right.
For the second term on the right hand side of (), Var [
log f (x|)|y, ], we note that

logf (y, z|) =
n
i=1
_
z
i
y
i

1
2
1
, z
i
_
(y
i

1
)
2
3
1
1
_
, (1 z
i
)
y
i

2
2
2
, (1 z
i
)
_
(y
i

2
)
2
3
2
2
_
,
z
i
q

1 z
i
1 q
_
T
.
and to compute the variance of each of these terms we will need the fact that Var(z
i
|y
i
, ) =
E(z
i
|y
i
, )(1 E(z
i
|y
i
, )). Then the R/Splus code below calculates
Var[logf (y, z|)|y,
k
] = (0.3936528, 1.4835747, 0.5189960, 2.5355206, 58.9115844)
T
6
Ez <- th0[5]*normf(y, th0[1], th0[2]) /
(th0[5]*normf(y, th0[1], th0[2]) + (1-th0[5])*normf(y, th0[3], th0[4]))
Vz <- Ez*(1-Ez) # variance of binomial(1, p)
V <- c(sum(Vz*(y-th0[1])2/th0[2]4), sum(Vz*((y-th0[1])2/th0[2]3 - 1/th0[2])2),
sum(Vz*(y-th0[3])2/th0[4]4), sum(Vz*((y-th0[3])2/th0[4]3 - 1/th0[4])2),
sum(Vz*(1/th0[5]+1/(1-th0[5]))2))
For the rst term on the right-hand side of (), E[
2
log f (x|)|y, ], we note that the main

diagonal of
2
logf (y, z|) is
n
i=1
_
z
i
2
1
, z
i
_
3(y
i

1
)
2
4
1
2
1
_
,
1 z
i
2
2
, (1 z
i
)
_
3(y
i

2
)
2
4
2
2
2
_
,
z
i
q
2
+
1 z
i
(1 q)
2
_
T
.
The following code evaluates
E[
2
logf (y, z|)|y,
k
] = (2.846707, 5.692247, 5.050176, 10.101635, 1179.206112)
T
Ez <- th0[5]*normf(y, th0[1], th0[2]) /
(th0[5]*normf(y, th0[1], th0[2]) + (1-th0[5])*normf(y, th0[3], th0[4]))
Vz <- Ez*(1-Ez) # variance of binomial(1, p)
E <- c(sum(Ez)/th0[2]2,
sum(Ez*(3*(y-th0[1])2/(th0[2]4)-1/(th0[2]2))),
sum(1-Ez)/th0[4]2, sum((1-Ez)*(3*(y-th0[3])2/(th0[4]4)-1/(th0[4]2))),
sum(Ez)/th0[5]2+sum(1-Ez)/(1-th0[5])2 )
Now, calculating the standard errors using the missing information principle, S E(
) =
1/(E V),
and using the infomat.s routine, is easy:
SE1 <- sqrt(1/(E-V)) # SE from Missing Information Principle
source("infomat.s") # numerical differentiator...
SE2 <- sqrt(1/diag(info.mtx.fun(llg, th0, hinit=rep(.001,5))))
# SE from infomat.s
> SE1
[1] 0.63847878 0.48744705 0.46977981 0.36354950 0.02987679
> SE2
[1] 0.63842116 0.48739376 0.46974987 0.36349582 0.02987575
We see that the two methods, SE1 = S E
mip
(
), and SE2 = S E
numerical
(
), give almost the same

result.
7

The Problem: Library (MASS) Data (Faithful) Attach (Faithful)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Problem: Library (MASS) Data (Faithful) Attach (Faithful)

Uploaded by

Copyright:

Available Formats

36-724 Old Faithful E-M Example

log f (x|)|y, ] Var [

log f (x|)|y, ], we note that

log f (x|)|y, ], we note that the main

), give almost the same

You might also like