Expectation Maximization

Expectation-Maximization (EM) is a general technique for maximum likelihood learning in the presence of counfounding (aka hidden, nuisance) variables.  Consider fiting a distribution p(a,b) of two variables.  However, we want to fit p to make p(a) accurate, not the joint distribution p(a,b).  (There are several reasons we might want to do this– the most common is that we only have training data for a.)  So, write down the log-likelihood for some data element \hat{a}.

L({\hat a}) = \log p({\hat a})

= \log( \sum_b p({\hat a},b))

= \log( \sum_b r(b|{\hat a}) p({\hat a},b)/r(b|{\hat a}) )

In the last step, r(b|{\hat a}) is any valid distribution over b.  (The reason for introducing this will be clear shortly.)

Now, recall Jensen’s inequality.  For any distribution q(x), and concave function f(x),

f(\sum_x q(x) x) \geq \sum_x q(x) f(x).

Applying this to the last expression we had for L({\hat x}), we get a lower bound.

L({\hat a}) \geq Q({\hat a})

Q({\hat a}) = \sum_b r(b|{\hat a}) ( \log p({\hat a},b) - \log r(b|{\hat a}))

The basic idea of EM is very simple.  Instead of directly trying to maximize L, instead maximize the lower bound, Q.  This is accomplished in two steps.

  • Maximize Q with respect to r.  This is called the “Expectation” (E) step.
  • Maximize Q with respect to p.  This is called the “Maximization” (M) step.

One performs the M-step basically like normal (joint) maximum likelihood learning.  That is, one fits p to maximize

\sum_{\hat a} \sum_b r(b|{\hat a}) \log p({\hat a},b),

which is basically just weighted maximum likelihood learning.

The E-step needs a bit more explanation.  (Why is it called the “expectation” step?)  It can be shown, by setting up a Lagrange multiplier problem, that the optimal r will in fact be r(b|{\hat a}) = p(b|{\hat a}).  Moreover, it is easy to show that, after substituting this value of r, Q({\hat a})=L({\hat a})!  Of course, once we start messing around with p in the M step while holding r constant, this will no longer be true, but we can see that at convergence, the bound will be tight.  This, EM truly maximizes L, not an approximation to it.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s