## Personal opinions about graphical models 1: The surrogate likelihood exists and you should use it.

When talking about graphical models with people  (particularly computer vision folks) I find myself advancing a few opinions over and over again.  So, in an effort to stop bothering people at conferences, I thought I’d write a few entries here.

The first thing I’d like to discuss is “surrogate likelihood” training.  (So far as I know, Martin Wainwright was the first person to give a name to this method.)

### Background

Suppose we want to fit a Markov random field (MRF).  I’m writing this as a generative model with an MRF for simplicity– pretty much the same story holds with a Conditional Random Field in the discriminative setting.

$p({\bf x}) = \frac{1}{Z} \prod_{c} \psi({\bf x}_c) \prod_i \psi(y_i)$

Here, the first product is over all cliques/factors in the graph, and the second is over all single variables.  Now, it is convenient to note that MRFs can be seen as members of the exponential family

$p({\bf x};{\boldsymbol \theta}) = \exp( {\boldsymbol \theta} \cdot {\bf f}({\bf x}) - A({\boldsymbol \theta}) )$,

where

${\bf f}({\bf X})=\{I[{\bf X}_{c}={\bf x}_{c}]|\forall c,{\bf x}_{c}\}\cup\{I[X_{i}=x_{i}]|\forall i,x_{i} \}$

is a function consisting of indicator functions for each possible configuration of each clique and variable, and the log-partition function

$A(\boldsymbol{\theta})=\log\sum_{{\bf x}}\exp\boldsymbol{\theta}\cdot{\bf f}({\bf x})$.

ensures normalization.

Now, the log-partition function has the very important (and easy to show) property that the gradient is the expected value of $\bf f$.

$\displaystyle \frac{dA}{d{\boldsymbol \theta}} = \sum_{\bf x} p({\bf x};{\boldsymbol \theta}) {\bf f}({\bf x})$

With a graphical model, what does this mean?  Well, notice that the expected value of, say, $I[X_i=x_i]$ will be exactly $p(x_i;{\boldsymbol \theta})$. Thus, the expected value of ${\bf f}$ will be a vector containing all univariate and clique-wise marginals.  If we write this as ${\boldsymbol \mu}({\boldsymbol \theta})$, then we have

$\displaystyle \frac{dA}{d{\boldsymbol \theta}} = {\boldsymbol \mu}({\boldsymbol \theta})$.

### The usual story

Suppose we want to do maximum likelihood learning.  This means we want to set ${\boldsymbol \theta}$ to maximize

$L( {\boldsymbol \theta} ) = \frac{1}{N}\sum_{\hat{{\bf x}}}\log p({\bf x};{\boldsymbol \theta})={\boldsymbol \theta}\cdot\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}})-A({\boldsymbol \theta}).$

If we want to use gradient ascent, we would just take a small step along the gradient.  This has a very intuitive form: it is the difference of the expected value of $\bf f$ under the model to the expected value of $\bf f$ under the current distribution.

$\displaystyle \frac{dL}{d{\boldsymbol \theta}} = \frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - \sum_{\bf x} p({\bf x};{\boldsymbol \theta}) {\bf f}({\bf x})$.

$\displaystyle \frac{dL}{d{\boldsymbol \theta}} = \frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - {\boldsymbol \mu}({\boldsymbol \theta})$.

Note the lovely property of moment matching here. If we have found a solution, then $dL/d{\boldsymbol \theta}=0$ and so the expected value of $\bf f$ under the current distribution will be exactly equal to that under the data.

Unfortunately, in a high-treewidth setting, we can’t compute the marginals.  That’s too bad.  However, we have all these lovely approximate inference algorithms (loopy belief propagation, tree-reweighted belief propagation, mean field, etc.).  Suppose we write the resulting approximate marginals as $\tilde{{\boldsymbol \mu}}({\boldsymbol \theta})$.  Then, instead of taking the above gradient step, why not instead just use

$\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - \tilde{{\boldsymbol \mu}}({\boldsymbol \theta})$?

That’s all fine!  However, I often see people say/imply/write some or all of the following:

1. This is not guaranteed to converge.
2. There is no longer any well-defined objective function being maximized.
3. We can’t use line searches.
4. We have to use (possibly stochastic) gradient ascent.
5. This whole procedure is frightening and shouldn’t be mentioned in polite company.

I agree that we should view this procedure with some suspicion, but it gets far more than it deserves! The first four points, in my view, are simply wrong.

### What’s missing

The critical thing that is missing from the above story is this: Approximate marginals come together with an approximate partition function!

That is, if you are computing approximate marginals $\tilde{{\boldsymbol \mu}}({\boldsymbol \theta})$ using loopy belief propagation, mean-field, or tree-reweighted belief propagation, there is a well-defined approximate log-partition function $\tilde{A}({\boldsymbol \theta})$ such that

$\displaystyle \tilde{{\boldsymbol \mu}}({\boldsymbol \theta}) = \frac{d\tilde{A}}{d{\boldsymbol \theta}}$.

What this means is that you should think, not of approximating the likelihood gradient, but of approximating the likelihood itself. Specifically, what the above is really doing is optimizing the “surrogate likelihood”

$\tilde{L}({\boldsymbol \theta}) = {\boldsymbol \theta}\cdot\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}})-\tilde{A}({\boldsymbol \theta}).$

What’s the gradient of this? It is

$\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - \tilde{{\boldsymbol \mu}}({\boldsymbol \theta}),$

or exactly the gradient that was being used above. The advantage of doing things this way is that it is a normal optimization.  There is a well-defined objective. It can be plugged into a standard optimization routine, such as BFGS, which will probably be faster than gradient ascent.  Line searches guarantee convergence. $\tilde{A}$ is perfectly tractable to compute. In fact, if you have already computed approximate marginals, $\tilde{A}$ has almost no cost. Life is good.

The only counterargument I can think of is that mean-field and loopy BP can have different local optima, which might mean that a no-line-search-refuse-to-look-at-the-objective-function-just-follow-the-gradient-and-pray style optimization could be more robust, though I’d like to see that argument made…

I’m not sure of the history, but I think part of the reason this procedure has such a bad reputation (even from people that use it!) might be that it predates the “modern” understanding of inference procedures as producing approximate partition functions as well as approximate marginals.

This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

### 7 Responses to Personal opinions about graphical models 1: The surrogate likelihood exists and you should use it.

1. Sebastian says:

The problem here is that the approximate marginals, as a function of the model parameters, are defined by the outcome of an inference algorithm. Therefore, by minimizing the Bethe free energy or the mean field energy a small perturbation to the model parameters may as a result lead to convergence to quite different marginal distributions. This happens for example by converging to a different local minima of the respective approximate free energy.

In my view this problem can only be overcome in two ways. The first option is to write the approximate free energy as a function of parameters and marginal distributions. Then we have a joint minimization problem in both and everything you say holds true, that is, most problems simply disappear. The second option is to use a convex free energy approximation as then the marginals as a function of parameters are always uniquely defined and can be found efficiently.

However, storing all marginal distributions during parameter estimation is demanding a lot of memory, and this is one of the reasons why most people do not do this. Instead, they simply run LBP from scratch using the updated parameters. This has all the above mentioned problems because the resulting “min-projected” free energy (projecting out the marginal dimensions by minimizing the function over them, making the resulting function L a function of the parameters only) is no longer continuous.

The above comments are based on extensive experiments and all the above statements can be demonstrated on small toy examples.

S.

2. justindomke says:

I totally agree– if one is using an inference procedure that is susceptible to local minima, the marginals become a non-continuous function of parameters (if you assume inference finds a global minima). The solution I’ve always used is exactly as you say– use a convex free energy like TRW.

If I were to use LBP or mean-field, though, I would probably re-run inference from the start in each iteration, even if memory wasn’t an issue. The reason is that this is at least fitting a well-defined mapping from parameters to marginals (albeit only a piecewise continuous one). If you go over a point in parameter space where the marginals change non-continuously, this is likely to result in a bad surrogate likelihood. As long as you are properly implementing surrogate likelihood using line searches, the line search should punish the marginals being in the region, and pull back to “good” parameter space. That said: 1) People usually only use the surrogate likelihood gradient, meaning once learning enters a “bad” parameter region, the situation is hopeless and 2) Even with proper line searches, learning is still optimizing a non-convex (and not everywhere continuous) function, with all the usual problems that go with that.

3. Sebastian says:

I understand your reasoning, but from a strict technical perspective a line search may not be possible. Let me elaborate.

All implementable line search methods I am aware of assume continuity of the function and a known descent direction at the current iterate. These go hand-in-hand: if your function is not continuous your computed descent direction, for example the gradient of your negative surrogate log-likelihood, may fail to provide any descent for any positive step size because the function is discontinuous at your current iterate and has larger function values for all positive step sizes.

This is not mere theoretical, but akin to using a continuous differentiable optimization code to optimize a non-differentiable function. For any given random starting point there are no problems in the first iteration as your set of non-differentiabilities has measure zero, but during optimization you will often converge to non-differentiable points.

So what happens if you put a surrogate log-likelihood into a standard optimization code such as L-BFGS?
You observe problems in the line search phase because basic assumptions (continuity, differentiability) are violated. In my experience the problem is worst for naive mean field approximations because the approximation is extreme. Another assumption, depending on the initialization and message scheduling, is that your inference results should be deterministic (for example, you may want to warm start mean field inference when the parameters are updated, but you better not because you may obtain two different function values on different occasions for the same parameter, and this violates another assumption optimization codes make).

From my point of view the points 1., 3., (and 5. 🙂 mentioned in the original article are real. I agree that using surrogate likelihood functions is a good idea, but you cannot just plug them into existing optimization codes and expect things to work.

4. justindomke says:

I agree that: 1) With a non-convex entropy, using warm-start on the marginals might not be a good idea, since then inference isn’t a deterministic mapping. I don’t like doing learning as a joint optimization of parameters and all marginals for a similar reason. 2) Using a line search on a piecewise continuous function such a LBP-based surrogate likelihood might stop at a non-differentiable point rather than a full local minimum.

What I was addressing in the original post is a mis-perception that there is no well-defined objective function being fit. If you are going to fit the surrogate likelihood, you are better off actually computing your objective function and using line searches.

I’m interested in your observation that learning often converges to a non-differentiable point. Have you seen this in practice with LBP? More precisely, have you found LBP to fail to achieve a pseudo-moment match (generative setting) or “conditional pseudo moment match” (conditional setting)? There have been many papers published using LBP surrogate likelihood that got reasonable results, though I guess publication bias could be an issue.

As for the original 5 points, it certainly is guaranteed to converge (#1) albeit possible to a non-differentiable point rather than a local minimum, and you surely can use line searches (#3) just without the guarantee of convergence to a local minimum. I admit that the loss of that guarantee might cause #5 to be true for certain people.

Anyway, I guess the take-home message is to use a convex entropy, since none of these issues exist there!

5. Sebastian says:

I guess we both agree that a very useful practical estimator can be derived from convex and non-convex entropy approximations. My main point in the previous post is that you cannot just use off-the-shelf optimization codes (L-BFGS, minimize.m, etc.) and expect it to work. As you explain, it can work if the inference is deterministic and you use a custom line-search that works when the usual assumptions fail (continuity, differentiability).

Regarding experiments, I put some small results online at http://www.nowozin.net/sebastian/tmp/ll/

What is shown there is a simple 3-by-3 generative model with just two parameters, and the model is linear in these parameters. I obtain 100 exact samples from the model and then perform maximum likelihood estimation. Every PDF file visualizes the negative surrogate log-likelihood surface as seen by an optimization code. BFINF is the true log-likelihood (convex), NMF and SMF are naive and structured mean field approximations (NMF is non-differentiable because of convergence to different minima, SMF seems fine), BPINF the Bethe log-likelihood (non-convex, but well behaved; but the presence of the second minima is interesting), MPLE and MCLE are pseudo- and composite likelihood approximations (convex). The generating parameters for 100 samples is shown with a red cross, the maximum surrogate likelihood estimate with a blue circle.

I have not seen the naive mean field inference fail on real instances, but did not use LBP extensively. For the cases where I used it, it seemed to work fine.

Sebastian

6. Sebastian says:

The last sentence should read “I have not seen loopy belief propagation fail on real instances, …”

7. justindomke says:

Great visualizations, thanks!