Professional Documents
Culture Documents
algorithm?
Chuong B Do & Serafim Batzoglou
The expectation maximization algorithm arises in many computational biology applications that involve probabilistic
models. What is it good for, and how does it work?
of guessing a probability distribution over was analyzed more generally by Hartley2 and by log probability logP(x;) of the observed data.
completions of missing data given the current Baum et al.3 in the context of hidden Markov Generally speaking, the optimization problem
model (known as the E-step) and then re- models, where it is commonly known as the addressed by the expectation maximization
estimating the model parameters using these Baum-Welch algorithm. The standard refer- algorithm is more difficult than the optimiza-
completions (known as the M-step). The name ence on the expectation maximization algo- tion used in maximum likelihood estimation.
E-step comes from the fact that one does not rithm and its convergence is Dempster et al4. In the complete data case, the objective func-
usually need to form the probability distribu- tion logP(x,z;) has a single global optimum,
tion over completions explicitly, but rather Mathematical foundations which can often be found in closed form (e.g.,
need only compute expected sufficient statis- How does the expectation maximization algo- equation 1). In contrast, in the incomplete data
2008 Nature Publishing Group http://www.nature.com/naturebiotechnology
tics over these completions. Similarly, the name rithm work? More importantly, why is it even case the function logP(x;) has multiple local
M-step comes from the fact that model reesti- necessary? maxima and no closed form solution.
mation can be thought of as maximization of The expectation maximization algorithm is To deal with this, the expectation maximi-
the expected log-likelihood of the data. a natural generalization of maximum likeli- zation algorithm reduces the difficult task of
Introduced as early as 1950 by Ceppellini et hood estimation to the incomplete data case. In optimizing logP(x;) into a sequence of simpler
al.1 in the context of gene frequency estima- particular, expectation maximization attempts optimization subproblems, whose objective
tion, the expectation maximization algorithm to find the parameters that maximize the functions have unique global maxima that can
often be computed in closed form. These sub-
problems are chosen in a way that guarantees
a Maximum likelihood their corresponding solutions (1),(2), and
Coin A Coin B will converge to a local optimum of logP(x;).
H T T T HH T H T H 5 H, 5 T
More specifically, the expectation maxi-
24
mization algorithm alternates between two
HH HH T HH HH H 9 H, 1 T A = 24 + 6 = 0.80 phases. During the E-step, expectation maxi-
mization chooses a function gt that lower
H T H HH HH T H H 8 H, 2 T
9 bounds logP(x;) everywhere, and for which
B = 9 + 11 = 0.45
H T H T T TH H T T 4 H, 6 T gt((t))= logP(x;(t) ) . During the M-step, the
expectation maximization algorithm moves
T H H H T HH H T H 7 H, 3 T to a new parameter set (t+1) that maximizes
24 H, 6 T 9 H, 11 T gt. As the value of the lower-bound gt matches
5 sets, 10 tosses per set
the objective function at (t), it follows that
logP(x;(t) ) = gt( (t)) gt((t+1))= logP(x;(t+1) ) s o
b Expectation maximization the objective function monotonically increases
during each iteration of expectation maximiza-
E-step tion! A graphical illustration of this argument
2
is provided in Supplementary Figure 1 online,
and a concise mathematical derivation of the
Coin A Coin B
expectation maximization algorithm is given
HTTTHHTHTH 0.45 x 0.55 x 2.2 H, 2.2 T 2.8 H, 2.8 T in Supplementary Note 1 online.
H H H HT H H H H H As with most optimization methods for
HTHHHHHTHH 0.80 x 0.20 x 7.2 H, 0.8 T 1.8 H, 0.2 T nonconcave functions, the expectation maxi-
HTHTTTHHTT
THHHTHHHTH 0.73 x 0.27 x 5.9 H, 1.5 T 2.1 H, 0.5 T mization algorithm comes with guarantees
only of convergence to a local maximum of
0.35 x 0.65 x 1.4 H, 2.1 T 2.6 H, 3.9 T the objective function (except in degenerate
cases). Running the procedure using multiple
(0)
A = 0.60 0.65 x 0.35x 4.5 H, 1.9 T 2.5 H, 1.1 T
initial starting parameters is often helpful;
(0)
B = 0.50 21.3 H, 8.6 T 11.7 H, 8.4 T similarly, initializing parameters in a way that
breaks symmetry in models is also important.
(1) 21.3 3 With this limited set of tricks, the expectation
A 21.3 + 8.6 0.71
maximization algorithm provides a simple
M-step
1 11.7 (10) and robust tool for parameter estimation in
(1)
0.58 A 0.80 models with incomplete data. In theory, other
B 11.7 + 8.4
4 (10)
B 0.52 numerical optimization techniques, such as
gradient descent or Newton-Raphson, could
be used instead of expectation maximization;
Figure 1 Parameter estimation for complete and incomplete data. (a) Maximum likelihood estimation. in practice, however, expectation maximization
For each set of ten tosses, the maximum likelihood procedure accumulates the counts of heads and has the advantage of being simple, robust and
tails for coins A and B separately. These counts are then used to estimate the coin biases. easy to implement.
(b) Expectation maximization. 1. EM starts with an initial guess of the parameters. 2. In the E-step,
a probability distribution over possible completions is computed using the current parameters. The
counts shown in the table are the expected numbers of heads and tails according to this distribution. Applications
3. In the M-step, new parameters are determined using the current completions. 4. After several Many probabilistic models in computational
repetitions of the E-step and M-step, the algorithm converges. biology include latent variables. In some
cases, these latent variables are present due and the remaining letters in each sequence as transcriptional modules10, tests of linkage
to missing or corrupted data; in most appli- coming from some fixed background distribu- disequilibrium11, protein identification12 and
cations of expectation maximization to com- tion. The observed data x consist of the letters medical imaging13.
putational biology, however, the latent factors of sequences, the unobserved latent factors z In each case, expectation maximization
are intentionally included, and parameter include the starting position of the motif in provides a simple, easy-to-implement and effi-
learning itself provides a mechanism for each sequence and the parameters describe cient tool for learning parameters of a model;
knowledge discovery. the position-specific letter frequencies for once these parameters are known, we can use
For instance, in gene expression cluster- the motif. Here, the expectation maximiza- probabilistic inference to ask interesting que-
ing5, we are given microarray gene expression tion algorithm involves computing the prob- ries about the model. For example, what cluster
2008 Nature Publishing Group http://www.nature.com/naturebiotechnology
measurements for thousands of genes under ability distribution over motif start positions does a particular gene most likely belong to?
varying conditions, and our goal is to group for each sequence (E-step) and updating the What is the most likely starting location of a
the observed expression vectors into distinct motif letter frequencies based on the expected motif in a particular sequence? What are the
clusters of related genes. One approach is to letter counts for each position in the motif most likely haplotype blocks making up the
model the vector of expression measurements (M-step). genotype of a specific individual? By provid-
for each gene as being sampled from a multi- In the haplotype inference problem7, we ing a straightforward mechanism for param-
variate Gaussian distribution (a generalization are given the unphased genotypes of indi- eter learning in all of these models, expectation
of a standard Gaussian distribution to multi- viduals from some population, where each maximization provides a mechanism for build-
ple correlated variables) associated with that unphased genotype consists of unordered ing and training rich probabilistic models for
genes cluster. In this case, the observed data pairs of single-nucleotide polymorphisms biological applications.
x correspond to microarray measurements, (SNPs) taken from homologous chromo-
the unobserved latent factors z are the assign- somes of the individual. Contiguous blocks Note: Supplementary information is available on the
ments of genes to clusters, and the parameters of SNPs inherited from a single chromo- Nature Biotechnology website.
include the means and covariance matrices some are called haplotypes. Assuming for
ACKNOWLEDGMENTS
of the multivariate Gaussian distributions simplicity that each individuals genotype is C.B.D. is supported in part by an National Science
representing the expression patterns for each a combination of two haplotypes (one mater- Foundation (NSF) Graduate Fellowship. S.B. wishes to
cluster. For parameter learning, the expectation nal and one paternal), the goal of haplotype acknowledge support by the NSF CAREER Award. We
maximization algorithm alternates between inference is to determine a small set of hap- thank four anonymous referees for helpful suggestions.
computing probabilities for assignments of lotypes that best explain all of the unphased 1. Ceppellini, R., Siniscalco, M. & Smith, C.A. Ann. Hum.
each gene to each cluster (E-step) and updat- genotypes observed in the population. Here, Genet. 20, 97115 (1955).
ing the cluster means and covariances based the observed data x are the known unphased 2. Hartley, H. Biometrics 14, 174194 (1958).
3. Baum, L.E., Petrie, T., Soules, G. & Weiss, N. Ann.
on the set of genes predominantly belonging genotypes for each individual, the unobserved Math. Stat. 41, 164171 (1970).
to that cluster (M-step). This can be thought latent factors z are putative assignments of 4. Dempster, A.P., Laird, N.M. & Rubin, D.B. J. R. Stat.
Soc. Ser. B 39, 138 (1977).
of as a soft variant of the popular k-means unphased genotypes to pairs of haplotypes
5. Dhaeseleer, P. Nat. Biotechnol. 23, 14991501
clustering algorithm, in which one alternates and the parameters describe the frequen- (2005).
between hard assignments of genes to clus- cies of each haplotype in the population. 6. Lawrence, C.E. & Reilly, A.A. Proteins 7, 4151
(1990).
ters and reestimation of cluster means based The expectation maximization algorithm 7. Excoffier, L. & Slatkin, M. Mol. Biol. Evol. 12, 921927
on their assigned genes. alternates between using the current haplo- (1995).
In motif finding6, we are given a set of type frequencies to estimate probability dis- 8. Krogh, A., Brown, M., Mian, I.S., Sjlander, K. &
Haussler, D. J. Mol. Biol. 235, 15011543 (1994).
unaligned DNA sequences and asked to identify tributions over phasing assignments for each 9. Eddy, S.R. & Durbin, R. Nucleic Acids Res. 22, 2079
a pattern of length W that is present (though unphased genotype (E-step) and using the 2088 (1994).
possibly with minor variations) in every expected phasing assignments to refine esti- 10. Segal, E., Yelensky, R. & Koller, D. Bioinformatics 19,
i273i282 (2003).
sequence from the set. To apply the expecta- mates of haplotype frequencies (M-step). 11. Slatkin, M. & Excoffier, L. Heredity 76, 377383
tion maximization algorithm, we model the Other problems in which the expectation (1996).
instance of the motif in each sequence as hav- maximization algorithm plays a prominent 12. Nesvizhskii, A.I., Keller, A., Kolker, E. & Aebersold, R.
Anal. Chem. 75, 46464658 (2003).
ing each letter sampled independently from role include learning profiles of protein 13. De Pierro, A.R. IEEE Trans. Med. Imaging 14, 132137
a position-specific distribution over letters, domains8 and RNA families9, discovery of (1995).