Expectation-Maximization (EM) is a general technique for maximum likelihood learning in the presence of counfounding (aka hidden, nuisance) variables. Consider fiting a distribution of two variables. However, we want to fit to make accurate, not the joint distribution . (There are several reasons we might want to do this– the most common is that we only have training data for .) So, write down the log-likelihood for some data element .
In the last step, is any valid distribution over . (The reason for introducing this will be clear shortly.)
Now, recall Jensen’s inequality. For any distribution , and concave function ,
Applying this to the last expression we had for , we get a lower bound.
The basic idea of EM is very simple. Instead of directly trying to maximize , instead maximize the lower bound, . This is accomplished in two steps.
- Maximize with respect to . This is called the “Expectation” (E) step.
- Maximize with respect to . This is called the “Maximization” (M) step.
One performs the M-step basically like normal (joint) maximum likelihood learning. That is, one fits to maximize
which is basically just weighted maximum likelihood learning.
The E-step needs a bit more explanation. (Why is it called the “expectation” step?) It can be shown, by setting up a Lagrange multiplier problem, that the optimal will in fact be . Moreover, it is easy to show that, after substituting this value of , ! Of course, once we start messing around with in the M step while holding constant, this will no longer be true, but we can see that at convergence, the bound will be tight. This, EM truly maximizes , not an approximation to it.