Personal opinions about graphical models 1: The surrogate likelihood exists and you should use it.

When talking about graphical models with people  (particularly computer vision folks) I find myself advancing a few opinions over and over again.  So, in an effort to stop bothering people at conferences, I thought I’d write a few entries here.

The first thing I’d like to discuss is “surrogate likelihood” training.  (So far as I know, Martin Wainwright was the first person to give a name to this method.)

Background

Suppose we want to fit a Markov random field (MRF).  I’m writing this as a generative model with an MRF for simplicity– pretty much the same story holds with a Conditional Random Field in the discriminative setting.

p({\bf x}) = \frac{1}{Z} \prod_{c} \psi({\bf x}_c) \prod_i \psi(y_i)

Here, the first product is over all cliques/factors in the graph, and the second is over all single variables.  Now, it is convenient to note that MRFs can be seen as members of the exponential family

p({\bf x};{\boldsymbol \theta}) = \exp( {\boldsymbol \theta} \cdot {\bf f}({\bf x}) - A({\boldsymbol \theta}) ),

where

{\bf f}({\bf X})=\{I[{\bf X}_{c}={\bf x}_{c}]|\forall c,{\bf x}_{c}\}\cup\{I[X_{i}=x_{i}]|\forall i,x_{i} \}

is a function consisting of indicator functions for each possible configuration of each clique and variable, and the log-partition function

A(\boldsymbol{\theta})=\log\sum_{{\bf x}}\exp\boldsymbol{\theta}\cdot{\bf f}({\bf x}).

ensures normalization.

Now, the log-partition function has the very important (and easy to show) property that the gradient is the expected value of \bf f.

\displaystyle \frac{dA}{d{\boldsymbol \theta}} = \sum_{\bf x} p({\bf x};{\boldsymbol \theta}) {\bf f}({\bf x})

With a graphical model, what does this mean?  Well, notice that the expected value of, say, I[X_i=x_i] will be exactly p(x_i;{\boldsymbol \theta}). Thus, the expected value of {\bf f} will be a vector containing all univariate and clique-wise marginals.  If we write this as {\boldsymbol \mu}({\boldsymbol \theta}), then we have

\displaystyle \frac{dA}{d{\boldsymbol \theta}} = {\boldsymbol \mu}({\boldsymbol \theta}).

The usual story

Suppose we want to do maximum likelihood learning.  This means we want to set {\boldsymbol \theta} to maximize

L( {\boldsymbol \theta} ) = \frac{1}{N}\sum_{\hat{{\bf x}}}\log p({\bf x};{\boldsymbol \theta})={\boldsymbol \theta}\cdot\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}})-A({\boldsymbol \theta}).

If we want to use gradient ascent, we would just take a small step along the gradient.  This has a very intuitive form: it is the difference of the expected value of \bf f under the model to the expected value of \bf f under the current distribution.

\displaystyle \frac{dL}{d{\boldsymbol \theta}} = \frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - \sum_{\bf x} p({\bf x};{\boldsymbol \theta}) {\bf f}({\bf x}).

\displaystyle \frac{dL}{d{\boldsymbol \theta}} = \frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - {\boldsymbol \mu}({\boldsymbol \theta}).

Note the lovely property of moment matching here. If we have found a solution, then dL/d{\boldsymbol \theta}=0 and so the expected value of \bf f under the current distribution will be exactly equal to that under the data.

Unfortunately, in a high-treewidth setting, we can’t compute the marginals.  That’s too bad.  However, we have all these lovely approximate inference algorithms (loopy belief propagation, tree-reweighted belief propagation, mean field, etc.).  Suppose we write the resulting approximate marginals as \tilde{{\boldsymbol \mu}}({\boldsymbol \theta}).  Then, instead of taking the above gradient step, why not instead just use

\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - \tilde{{\boldsymbol \mu}}({\boldsymbol \theta})?

That’s all fine!  However, I often see people say/imply/write some or all of the following:

  1. This is not guaranteed to converge.
  2. There is no longer any well-defined objective function being maximized.
  3. We can’t use line searches.
  4. We have to use (possibly stochastic) gradient ascent.
  5. This whole procedure is frightening and shouldn’t be mentioned in polite company.

I agree that we should view this procedure with some suspicion, but it gets far more than it deserves! The first four points, in my view, are simply wrong.

What’s missing

The critical thing that is missing from the above story is this: Approximate marginals come together with an approximate partition function!

That is, if you are computing approximate marginals \tilde{{\boldsymbol \mu}}({\boldsymbol \theta}) using loopy belief propagation, mean-field, or tree-reweighted belief propagation, there is a well-defined approximate log-partition function \tilde{A}({\boldsymbol \theta}) such that

\displaystyle \tilde{{\boldsymbol \mu}}({\boldsymbol \theta}) = \frac{d\tilde{A}}{d{\boldsymbol \theta}}.

What this means is that you should think, not of approximating the likelihood gradient, but of approximating the likelihood itself. Specifically, what the above is really doing is optimizing the “surrogate likelihood”

\tilde{L}({\boldsymbol \theta}) = {\boldsymbol \theta}\cdot\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}})-\tilde{A}({\boldsymbol \theta}).

What’s the gradient of this? It is

\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - \tilde{{\boldsymbol \mu}}({\boldsymbol \theta}),

or exactly the gradient that was being used above. The advantage of doing things this way is that it is a normal optimization.  There is a well-defined objective. It can be plugged into a standard optimization routine, such as BFGS, which will probably be faster than gradient ascent.  Line searches guarantee convergence. \tilde{A} is perfectly tractable to compute. In fact, if you have already computed approximate marginals, \tilde{A} has almost no cost. Life is good.

The only counterargument I can think of is that mean-field and loopy BP can have different local optima, which might mean that a no-line-search-refuse-to-look-at-the-objective-function-just-follow-the-gradient-and-pray style optimization could be more robust, though I’d like to see that argument made…

I’m not sure of the history, but I think part of the reason this procedure has such a bad reputation (even from people that use it!) might be that it predates the “modern” understanding of inference procedures as producing approximate partition functions as well as approximate marginals.

Advertisements