You are on page 1of 3

primer

What is the expectation maximization


2008 Nature Publishing Group http://www.nature.com/naturebiotechnology

algorithm?
Chuong B Do & Serafim Batzoglou
The expectation maximization algorithm arises in many computational biology applications that involve probabilistic
models. What is it good for, and how does it work?

P robabilistic models, such as hidden Markov


models or Bayesian networks, are com-
monly used to model biological data. Much
z = (z1, z2,, z5), where xi {0,1,,10} is the
number of heads observed during the ith set of
tosses, and zi {A,B} is the identity of the coin
dont know the coin used for each set of tosses.
However, if we had some way of completing the
data (in our case, guessing correctly which coin
of their popularity can be attributed to the used during the ith set of tosses. Parameter esti- was used in each of the five sets), then we could
existence of efficient and robust procedures mation in this setting is known as the complete reduce parameter estimation for this problem
for learning parameters from observations. data case in that the values of all relevant ran- with incomplete data to maximum likelihood
Often, however, the only data available for dom variables in our model (that is, the result estimation with complete data.
training a probabilistic model are incomplete. of each coin flip and the type of coin used for One iterative scheme for obtaining comple-
Missing values can occur, for example, in medi- each flip) are known. tions could work as follows: starting from some
cal diagnosis, where patient histories generally Here, a simple way to estimate A and B is initial parameters, (t)= ((t),(t)) , determine for
include results from a limited battery of tests. to return the observed proportions of heads for each of the five sets whether coin A or coin B
Alternatively, in gene expression clustering, each coin: was more likely to have generated the observed
incomplete data arise from the intentional flips (using the current parameter estimates).
omission of gene-to-cluster assignments in the  # of heads using coin A Then, assume these completions (that is,
= (1)
probabilistic model. The expectation maximi- total # of flips using coin A guessed coin assignments) to be correct, and
zation algorithm enables parameter estimation apply the regular maximum likelihood estima-
in probabilistic models with incomplete data. and tion procedure to get (t+1). Finally, repeat these
# of heads using coin B two steps until convergence. As the estimated
=
A coin-flipping experiment total # of flips using coin B model improves, so too will the quality of the
As an example, consider a simple coin-flip- resulting completions.
ping experiment in which we are given a pair This intuitive guess is, in fact, known in the The expectation maximization algorithm
of coins A and B of unknown biases, A and statistical literature as maximum likelihood is a refinement on this basic idea. Rather than
B, respectively (that is, on any given flip, coin estimation (roughly speaking, the maximum picking the single most likely completion of the
A will land on heads with probability A and likelihood method assesses the quality of a missing coin assignments on each iteration, the
tails with probability 1A and similarly for statistical model based on the probability it expectation maximization algorithm computes
coin B). Our goal is to estimate = (A,B) by assigns to the observed data). If logP(x,z;) is probabilities for each possible completion of
repeating the following procedure five times: the logarithm of the joint probability (or log- the missing data, using the current parameters
randomly choose one of the two coins (with likelihood) of obtaining any particular vector (t). These probabilities are used to create a
equal probability), and perform ten indepen- of observed head counts x and coin types z, weighted training set consisting of all possible
dent coin tosses with the selected coin. Thus, then the formulas in (1) solve for the param- completions of the data. Finally, a modified
the entire procedure involves a total of 50 coin eters = (A ,B ) that maximize logP(x,z;). version of maximum likelihood estimation
tosses (Fig. 1a). Now consider a more challenging variant of that deals with weighted training examples
During our experiment, suppose that we the parameter estimation problem in which we provides new parameter estimates, (t+1). By
keep track of two vectors x = (x1, x2, , x5) and are given the recorded head counts x but not using weighted training examples rather than
the identities z of the coins used for each set choosing the single best completion, the expec-
Chuong B. Do and Serafim Batzoglou are in of tosses. We refer to z as hidden variables or tation maximization algorithm accounts for
the Computer Science Department, Stanford latent factors. Parameter estimation in this new the confidence of the model in each comple-
University, 318 Campus Drive, Stanford, setting is known as the incomplete data case. tion of the data (Fig. 1b).
California 94305-5428, USA. This time, computing proportions of heads In summary, the expectation maximiza-
e-mail: chuong@cs.stanford.edu for each coin is no longer possible, because we tion algorithm alternates between the steps

nature biotechnology volume 26 number 8 august 2008 897


p r ime r

of guessing a probability distribution over was analyzed more generally by Hartley2 and by log probability logP(x;) of the observed data.
completions of missing data given the current Baum et al.3 in the context of hidden Markov Generally speaking, the optimization problem
model (known as the E-step) and then re- models, where it is commonly known as the addressed by the expectation maximization
estimating the model parameters using these Baum-Welch algorithm. The standard refer- algorithm is more difficult than the optimiza-
completions (known as the M-step). The name ence on the expectation maximization algo- tion used in maximum likelihood estimation.
E-step comes from the fact that one does not rithm and its convergence is Dempster et al4. In the complete data case, the objective func-
usually need to form the probability distribu- tion logP(x,z;) has a single global optimum,
tion over completions explicitly, but rather Mathematical foundations which can often be found in closed form (e.g.,
need only compute expected sufficient statis- How does the expectation maximization algo- equation 1). In contrast, in the incomplete data
2008 Nature Publishing Group http://www.nature.com/naturebiotechnology

tics over these completions. Similarly, the name rithm work? More importantly, why is it even case the function logP(x;) has multiple local
M-step comes from the fact that model reesti- necessary? maxima and no closed form solution.
mation can be thought of as maximization of The expectation maximization algorithm is To deal with this, the expectation maximi-
the expected log-likelihood of the data. a natural generalization of maximum likeli- zation algorithm reduces the difficult task of
Introduced as early as 1950 by Ceppellini et hood estimation to the incomplete data case. In optimizing logP(x;) into a sequence of simpler
al.1 in the context of gene frequency estima- particular, expectation maximization attempts optimization subproblems, whose objective
tion, the expectation maximization algorithm to find the parameters that maximize the functions have unique global maxima that can
often be computed in closed form. These sub-
problems are chosen in a way that guarantees
a Maximum likelihood their corresponding solutions (1),(2), and
Coin A Coin B will converge to a local optimum of logP(x;).
H T T T HH T H T H 5 H, 5 T
More specifically, the expectation maxi-
24
mization algorithm alternates between two
HH HH T HH HH H 9 H, 1 T A = 24 + 6 = 0.80 phases. During the E-step, expectation maxi-
mization chooses a function gt that lower
H T H HH HH T H H 8 H, 2 T
9 bounds logP(x;) everywhere, and for which
B = 9 + 11 = 0.45
H T H T T TH H T T 4 H, 6 T gt((t))= logP(x;(t) ) . During the M-step, the
expectation maximization algorithm moves
T H H H T HH H T H 7 H, 3 T to a new parameter set (t+1) that maximizes
24 H, 6 T 9 H, 11 T gt. As the value of the lower-bound gt matches
5 sets, 10 tosses per set
the objective function at (t), it follows that
logP(x;(t) ) = gt( (t)) gt((t+1))= logP(x;(t+1) ) s o
b Expectation maximization the objective function monotonically increases
during each iteration of expectation maximiza-
E-step tion! A graphical illustration of this argument
2
is provided in Supplementary Figure 1 online,
and a concise mathematical derivation of the
Coin A Coin B
expectation maximization algorithm is given
HTTTHHTHTH 0.45 x 0.55 x 2.2 H, 2.2 T 2.8 H, 2.8 T in Supplementary Note 1 online.
H H H HT H H H H H As with most optimization methods for
HTHHHHHTHH 0.80 x 0.20 x 7.2 H, 0.8 T 1.8 H, 0.2 T nonconcave functions, the expectation maxi-
HTHTTTHHTT
THHHTHHHTH 0.73 x 0.27 x 5.9 H, 1.5 T 2.1 H, 0.5 T mization algorithm comes with guarantees
only of convergence to a local maximum of
0.35 x 0.65 x 1.4 H, 2.1 T 2.6 H, 3.9 T the objective function (except in degenerate
cases). Running the procedure using multiple
(0)
A = 0.60 0.65 x 0.35x 4.5 H, 1.9 T 2.5 H, 1.1 T
initial starting parameters is often helpful;
(0)
B = 0.50 21.3 H, 8.6 T 11.7 H, 8.4 T similarly, initializing parameters in a way that
breaks symmetry in models is also important.
(1) 21.3 3 With this limited set of tricks, the expectation
A 21.3 + 8.6 0.71
maximization algorithm provides a simple
M-step
1 11.7 (10) and robust tool for parameter estimation in

(1)
0.58 A 0.80 models with incomplete data. In theory, other
B 11.7 + 8.4
4 (10)
B 0.52 numerical optimization techniques, such as
gradient descent or Newton-Raphson, could
be used instead of expectation maximization;
Figure 1 Parameter estimation for complete and incomplete data. (a) Maximum likelihood estimation. in practice, however, expectation maximization
For each set of ten tosses, the maximum likelihood procedure accumulates the counts of heads and has the advantage of being simple, robust and
tails for coins A and B separately. These counts are then used to estimate the coin biases. easy to implement.
(b) Expectation maximization. 1. EM starts with an initial guess of the parameters. 2. In the E-step,
a probability distribution over possible completions is computed using the current parameters. The
counts shown in the table are the expected numbers of heads and tails according to this distribution. Applications
3. In the M-step, new parameters are determined using the current completions. 4. After several Many probabilistic models in computational
repetitions of the E-step and M-step, the algorithm converges. biology include latent variables. In some

898 volume 26 number 8 august 2008 nature biotechnology


p r ime r

cases, these latent variables are present due and the remaining letters in each sequence as transcriptional modules10, tests of linkage
to missing or corrupted data; in most appli- coming from some fixed background distribu- disequilibrium11, protein identification12 and
cations of expectation maximization to com- tion. The observed data x consist of the letters medical imaging13.
putational biology, however, the latent factors of sequences, the unobserved latent factors z In each case, expectation maximization
are intentionally included, and parameter include the starting position of the motif in provides a simple, easy-to-implement and effi-
learning itself provides a mechanism for each sequence and the parameters describe cient tool for learning parameters of a model;
knowledge discovery. the position-specific letter frequencies for once these parameters are known, we can use
For instance, in gene expression cluster- the motif. Here, the expectation maximiza- probabilistic inference to ask interesting que-
ing5, we are given microarray gene expression tion algorithm involves computing the prob- ries about the model. For example, what cluster
2008 Nature Publishing Group http://www.nature.com/naturebiotechnology

measurements for thousands of genes under ability distribution over motif start positions does a particular gene most likely belong to?
varying conditions, and our goal is to group for each sequence (E-step) and updating the What is the most likely starting location of a
the observed expression vectors into distinct motif letter frequencies based on the expected motif in a particular sequence? What are the
clusters of related genes. One approach is to letter counts for each position in the motif most likely haplotype blocks making up the
model the vector of expression measurements (M-step). genotype of a specific individual? By provid-
for each gene as being sampled from a multi- In the haplotype inference problem7, we ing a straightforward mechanism for param-
variate Gaussian distribution (a generalization are given the unphased genotypes of indi- eter learning in all of these models, expectation
of a standard Gaussian distribution to multi- viduals from some population, where each maximization provides a mechanism for build-
ple correlated variables) associated with that unphased genotype consists of unordered ing and training rich probabilistic models for
genes cluster. In this case, the observed data pairs of single-nucleotide polymorphisms biological applications.
x correspond to microarray measurements, (SNPs) taken from homologous chromo-
the unobserved latent factors z are the assign- somes of the individual. Contiguous blocks Note: Supplementary information is available on the
ments of genes to clusters, and the parameters of SNPs inherited from a single chromo- Nature Biotechnology website.
include the means and covariance matrices some are called haplotypes. Assuming for
ACKNOWLEDGMENTS
of the multivariate Gaussian distributions simplicity that each individuals genotype is C.B.D. is supported in part by an National Science
representing the expression patterns for each a combination of two haplotypes (one mater- Foundation (NSF) Graduate Fellowship. S.B. wishes to
cluster. For parameter learning, the expectation nal and one paternal), the goal of haplotype acknowledge support by the NSF CAREER Award. We
maximization algorithm alternates between inference is to determine a small set of hap- thank four anonymous referees for helpful suggestions.
computing probabilities for assignments of lotypes that best explain all of the unphased 1. Ceppellini, R., Siniscalco, M. & Smith, C.A. Ann. Hum.
each gene to each cluster (E-step) and updat- genotypes observed in the population. Here, Genet. 20, 97115 (1955).
ing the cluster means and covariances based the observed data x are the known unphased 2. Hartley, H. Biometrics 14, 174194 (1958).
3. Baum, L.E., Petrie, T., Soules, G. & Weiss, N. Ann.
on the set of genes predominantly belonging genotypes for each individual, the unobserved Math. Stat. 41, 164171 (1970).
to that cluster (M-step). This can be thought latent factors z are putative assignments of 4. Dempster, A.P., Laird, N.M. & Rubin, D.B. J. R. Stat.
Soc. Ser. B 39, 138 (1977).
of as a soft variant of the popular k-means unphased genotypes to pairs of haplotypes
5. Dhaeseleer, P. Nat. Biotechnol. 23, 14991501
clustering algorithm, in which one alternates and the parameters describe the frequen- (2005).
between hard assignments of genes to clus- cies of each haplotype in the population. 6. Lawrence, C.E. & Reilly, A.A. Proteins 7, 4151
(1990).
ters and reestimation of cluster means based The expectation maximization algorithm 7. Excoffier, L. & Slatkin, M. Mol. Biol. Evol. 12, 921927
on their assigned genes. alternates between using the current haplo- (1995).
In motif finding6, we are given a set of type frequencies to estimate probability dis- 8. Krogh, A., Brown, M., Mian, I.S., Sjlander, K. &
Haussler, D. J. Mol. Biol. 235, 15011543 (1994).
unaligned DNA sequences and asked to identify tributions over phasing assignments for each 9. Eddy, S.R. & Durbin, R. Nucleic Acids Res. 22, 2079
a pattern of length W that is present (though unphased genotype (E-step) and using the 2088 (1994).
possibly with minor variations) in every expected phasing assignments to refine esti- 10. Segal, E., Yelensky, R. & Koller, D. Bioinformatics 19,
i273i282 (2003).
sequence from the set. To apply the expecta- mates of haplotype frequencies (M-step). 11. Slatkin, M. & Excoffier, L. Heredity 76, 377383
tion maximization algorithm, we model the Other problems in which the expectation (1996).
instance of the motif in each sequence as hav- maximization algorithm plays a prominent 12. Nesvizhskii, A.I., Keller, A., Kolker, E. & Aebersold, R.
Anal. Chem. 75, 46464658 (2003).
ing each letter sampled independently from role include learning profiles of protein 13. De Pierro, A.R. IEEE Trans. Med. Imaging 14, 132137
a position-specific distribution over letters, domains8 and RNA families9, discovery of (1995).

nature biotechnology volume 26 number 8 august 2008 899

You might also like