Statistics – the rules of the game

What is statistics about, really? It’s easy to go through a class and get the impression that it’s about manipulating intimidating formulas. But what’s the goal of them? Why did people invent them?

If you zoom out, the big picture is more conceptual than mathematical. Statistics has a crazy, grasping ambition: it wants to tell you how to best use observations to make decisions. For example, you might look at how much it rained each day in the last week, and decide if you should bring an umbrella today. Statistics converts data into ideal actions.

Here, I’ll try to explain this view. I think it’s possible to be quite precise about this while using almost no statistics background and extremely minimal math.

The two important characters that we meet are decision rules and loss functions. Informally, a decision rule is just some procedure that looks at a dataset and makes a choice. A loss function — a basic concept from decision theory– is a precise description of “how bad” a given choice is.

Model Problem: Coinflips

Let’s say you’re confronted with a coin where the odds of heads and tails are not known ahead of time. Still, you are allowed to observe how the coin performs over a number of flips. After that, you’ll need to make a “decision” about the coin. Explicitly:

  • You’ve got a coin, which comes up heads with probability w. You don’t know w.
  • You flip the coin n times.
  • You see k heads and n-k tails.
  • You do something, depending on k. (We’ll come back to this.)

Simple enough, right? Remember, k is the total number of heads after n flips. If you do some math, you can work out a formula for p(k\vert w,n): the probability of seeing exactly k heads. For our purposes, it doesn’t really matter what that formula is, just that it exists. It’s known as a Binomial distribution, and so is sometimes written \mathrm{Binomial}(k\vert n,w).

Here’s an example of what this looks like with n=21 and w=0.5.

bernoulli_21_0.5

Naturally enough, if w=.5, with 21 flips, you tend to see around 10-11 heads. Here’s an example with w=0.2. Here, the most common value is 4, close to 21\times.2=4.2.

bernoulli_21_0.2

Decisions, decisions

After observing some coin flips, what do we do next? You can imagine facing various possible situations, but we will use the following:

Our situation: After observing n coin flips, you need to guess “heads” or “tails”, for one final coin flip.

Here, you just need to “decide” what the next flip will be. You could face many other decisions, e.g. guessing the true value of w.

Now, suppose that you have a friend who seems very skilled at predicting the final coinflip. What information would you need to reproduce your friend’s skill? All you need to know is if your friend predicts heads or tails for each possible value of k. We think of this as a decision rule, which we write abstractly as

\mathrm{Dec}(k).

This is just a function of one integer k. You can think of this as just a list of what guess to make, for each possible observation, for example:

k \mathrm{Dec}(k)
0 \mathrm{tails}
1 \mathrm{heads}
2 \mathrm{heads}
\vdots \vdots
n \mathrm{tails}

One simple decision rule would be to just predict heads if you saw more heads than tails, i.e. to use

\mathrm{Dec}(k)=\begin{cases} \mathrm{heads}, & k\geq n/2 \\ \mathrm{tails} & k<n/2 \end{cases}.

The goal of statistics is to find the best decision rule, or at least a good one. The rule above is intuitive, but not necessarily the best. And… wait a second… what does it even mean for one decision rule to be “better” than another?

Our goal: minimize the thing that’s designed to be minimized

What happens after you make a prediction? Consider our running example. There are many possibilities, but here are two of the simplest:

  • Loss A: If you predicted wrong, you lose a dollar. If you predicted correctly, nothing happens.
  • Loss B: If you predict “tails” and “heads” comes up, you lose 10 dollars. If you predict “heads” and “tails” comes up, you lose 1 dollar. If you predict correctly, nothing happens.

We abstract these through a concept of a loss function. We write this as

L(w,d).

The first input is the true (unknown) value w, while second input is the “prediction” you made. We want the loss to be small.

Now, one point might be confusing. We defined our situation as predicting the next coinflip, but now L is defined comparing d to w, not to a new coinflip. We do this because comparing to w gives the most generality. To deal with our situation, just use the average amount of money you’d lose if the true value of the coin were w. Take loss A. If you predict “tails”, you’ll be wrong with probability w, while if you predict “heads”, you’ll be wrong with probability 1-w, and so lose 1-w dollars on average. This leads to the loss

L_{A}(w,d)=\begin{cases} w & d=\mathrm{tails}\\ 1-w & d=\mathrm{heads} \end{cases}.

For loss B, the situation is slightly different, in that you lose 10 times as much in the first case. Thus, the loss is

L_{B}(w,d)=\begin{cases} 10w & d=\mathrm{tails}\\ 1-w & d=\mathrm{heads} \end{cases}.

The definition of a loss function might feel circular– we minimize the loss because we defined the loss as the thing that we want to minimize. What’s going on? Well, a statistical problem has two separate parts: a model of the data generating process, and a loss function describing your goals. Neither of these things is determined by the other.

So, the loss function is part of the problem. Statistics wants to give you what you want. But you need to tell statistics what that is.

Despite the name, a “loss” can be negative– you still just want to minimize it. Machine learning, always optimistic, favors “reward” functions that are to be maximized. Plus ça change.

Model + Loss = Risk

OK! So, we’ve got a model of our data generating process, and we specified some loss function. For a given w, we know the distribution over k, so… I guess… we want to minimize it?

Let’s define the risk to be the average loss that a decision rule gives for a particular value of w. That is,

R(w,\mathrm{Dec})=\sum_{k}p(k\vert w,n)L(w,\mathrm{Dec}(k)).

Here, the second input to R is a decision rule– a precise recipe of what decision to make in each possible situation.

Let’s visualize this. As a set of possible decision rules, I will just consider rules that predict “heads” if they’ve seen at least m heads, and “tails” otherwise:

\mathrm{Dec}_{m}(k)=\begin{cases} \mathrm{heads} & k\geq m\\ \mathrm{tails} & k<m \end{cases}.

With n=21 there are 22 such decision rules, corresponding to m=0, (always predict heads), m=1 (predict heads if you see at least one heads), up to m=22 (always predict tails). These are shown here:

dec_points
These rules are intuitive: if you’d predict heads after observing 16 heads out of 21, it would be odd to predict tails after seeing 17 instead! It’s true that for losses L_{A} and L_{B}, you don’t lose anything by restricting to this kind of decision rule. However, there are losses for which these decision rules are not enough. (Imagine you lose more when your guess is correct.)

With those decision rules in place, we can visualize what risk looks like. Here, I fix w=0.2, and I sweep through all the decision rules (by changing m) with loss L_{A}:

onerisk_0.2_A

The value R_A in the bottom plot is the total area of the green bars in the middle. You can do the same sweep for w=0.4, which you can is pictured here:

onerisk_0.2_A

We can visualize the risk in one figure with various w and m. Notice that the curves for w=0.2 and w=0.4 are exactly the same as we saw above.

all_risk_L_A.png

Of course, we get a different risk depending on what loss function we use. If we repeat the whole process using loss L_{B} we get the following:

all_risk_L_B.png

Dealing with risk

What’s the point of risk? It tells us how good a decision rule is. We want a decision rule where risk is as low as possible. So you might ask, why not just choose the decision rule that minimizes R(w,\mathrm{Dec})?

The answer is: because we don’t know w! How do we deal with that? Believe it or not, there isn’t a single well-agreed upon “right” thing to do, and so we meet two different schools of thought.

Option 1 : All probability all the time

Bayesian statistics (don’t ask about the name) defines a “prior” distribution p(w) over w. This says which values of w we think are more and less likely. Then, we define the Bayesian risk as the average of R over the prior:

R_{\mathrm{Bayes}}(\mathrm{Dec})=\int_{w=0}^{1}p(w)R(w,\mathrm{Dec})dw.

This just amounts to “averaging” over all the risk curves, weighted by how “probable” we think w is. Here’s the Bayes risk corresponding to L_{A} with a uniform prior p(w)=1:

bayes_risk_A

For reference, the risk curves R(w,\mathrm{Dec}_m) are shown in light grey. Naturally enough, for each value of m, the Bayes risk is just the average of the regular risks for each w.

Here’s the risk corresponding to L_{B}:

bayes_risk_B

That’s all quite natural. But we haven’t really searched through all the decision rules, only the simple ones \mathrm{Dec}_m. For other losses, these simple ones might not be enough, and there are a lot of decision rules. (Even for this toy problem there are 2^{22}, since you can output heads or tails for each of k=0, k=1, …, k=21.)

Fortunately, we can get a formula for the best decision rule for any loss. First, re-write the Bayes risk as

R_{\mathrm{Bayes}}(\mathrm{Dec})=\sum_{k} \left( \int_{w=0}^{1}p(w)p(k\vert n,w)L(w,\mathrm{Dec}(k))dw \right).

This is a sum over k where each term only depends on a single value \mathrm{Dec}(k). So, we just need to make the best decision for each individual value of k separately. This leads to the Bayes-optimal decision rule of

\mathrm{Dec}_{\text{Bayes}}(k)=\arg\min_{d}\int_{w=0}^{1}p(w)p(k\vert w,n)L(w,d)dw.

With a uniform prior p(w)=1, here’s the optimal Bayesian decision rules with loss L_{A}:

d_bayes_A.png

And here it is for loss L_B:

d_bayes_B.png

Look at that! Just mechanically plugging the loss function into the Bayes-optimal decision rule naturally gives us the behavior we expected– for L_{B}, the rule is very hesitant to predict tails, since the loss is so high if you’re wrong. (Again, these happen to fit in the parameterized family \mathrm{Dec}_{m} defined above, but we didn’t use this assumption in deriving the rules.)

The nice thing about the Bayesian approach is that it’s so systematic. No creativity or cleverness is required. If you specify the data generating process (p(k\vert w,n)) the loss function (L(w,d)) and the prior distribution (p(w)) then the optimal Bayesian decision rule is determined.

There are some disadvantages as well:

  • Firstly, you need to make up the prior, and if you do a terrible job, you’ll get a poor decision rule. If you have little prior knowledge, this can feel incredibly arbitrary. (Imagine you’re trying to estimate Big G.) Different people can have different priors, and then get different results.
  • Actually computing the decision rule requires doing an integral over w, which can be tricky in practice.
  • Even if your prior is good, the decision rule is only optimal when averaged over the prior. Suppose, for every day for the next 10,000 years, a random coin is created with w drawn from p(w). Then, no decision rule will incur less loss than \mathrm{Dec}_{\text{Bayes}}. However, on any particular day, some other decision rule could certainly be better.

So, if you have little idea of your prior, and/or you’re only making a single decision, you might not find much comfort in the Bayesian guarantee.

Some argue that these aren’t really disadvantages. Prediction is impossible without some assumptions, and priors are upfront and explicit. And no method can be optimal for every single day. If you just can’t handle that the risk isn’t optimal for each individual trial, then… maybe go for a walk or something?

Option 2 : Be pessimistic

Frequentist statistics (Why “frequentist”? Don’t think about it!) often takes a different path. Instead of defining a prior over w, let’s take a worst-case view. Let’s define the worst-case risk as

R_{\mathrm{worst}}(\mathrm{Dec})=\max_{w}R(w,\mathrm{Dec}).

Then, we’d like to choose an estimator to minimize the worst-case risk. We call this a “minimax” estimator since we minimize the max (worst-case) risk.

Let’s visualize this with our running example and L_{A}:

worst_risk_A

As you can see, for each individual decision rule, it searches over the space of parameters w to find the worst case. We can visualize the risk with L_{B} as:

worst_risk_B

What’s the corresponding minimax decision rule? This is a little tricky to deal with– to see why, let’s expand the worst-case risk a bit more:

R_{\mathrm{Worst}}(\mathrm{Dec})=\max_{w}\sum_{k}p(k\vert n,w)L(w,\mathrm{Dec}(k)).

Unfortunately, we can’t interchange the max and the sum, like we did with the integral and the sum for Bayesian decision rules. This makes it more difficult to write down a closed-form solution. At least in this case, we can still find the best decision rule by searching over our simple rules \mathrm{Dec}_m. But be very mindful that this doesn’t work in general!

For L_{A} we end up with the same decision rule as when minimizing Bayesian risk:

d_minimax_A.png

For L_{B}, meanwhile, we get something slightly different:

d_minimax_B.png

This is even more conservative than the Bayesian decision rule. \mathrm{Dec}_{B-\mathrm{Bayes}}(2)=\mathrm{tails}, while \mathrm{Dec}_{B-\mathrm{minimax}}(2)=\mathrm{heads}. That is, the Bayesian method predicts heads when it observes 2 or more, while the minimax rule predicts heads if it observes even one. This makes sense intuitively: The minimax decision rule proceeds as if the “worst” w (a small number) is fixed, whereas the Bayesian decision rule less pessimistically averages over all w.

Which decision rule will work better? Well, if w happens to be near the worst-case value, the minimax rule will be better. If you repeat the whole experiment many times with w drawn from the prior, the Bayesian decision rule will be.

If you do the experiment at some w far from the worst-case value, or you repeat the experiment many times with w drawn from a distribution different from your prior, then you have no guarantees.

Neither approach is “better” than the other, they just provide different guarantees. You need to choose what guarantee you want. (You can kind of think of this as a “meta” loss.)

So what about all those formulas, then?

For real problems, the data generating process is usually much more complex than a Binomial. The “decision” is usually more complex than predicting a coinflip– the most common decision is making a guess for the value of w. Even calculating R(w,\mathrm{Dec}) for fixed w and \mathrm{Dec} is often computationally hard, since you need to integrate over all possible observations. In general, finding exact Bayes or minimax optimal decision rules is a huge computational challenge, and at least some degree of approximation is required. That’s the game, that’s why statistics is hard. Still, even for complex situations the rules are the same– you win by finding a decision rule with low risk.

Advertisements

A Divergence Bound For Hybrids of MCMC and Variational Inference and …

At ICML I recently published a paper that I somehow decided to title “A Divergence Bound for Hybrids of MCMC and Variational Inference and an Application to Langevin Dynamics and SGVI”. This paper gives one framework for building “hybrid” algorithms between Markov chain Monte Carlo (MCMC) and Variational inference (VI) algorithms. Then, it gives an example for particular algorithms, namely:

  • MCMC ⇔ Stochastic Gradient Langevin Dynamics [1] [2] [3]
  • VI ⇔ Stochatic Gradient VI [4] [5] [6]

Outline:

Is there a pleasing visualization?

Here’s three different views of the algorithm for a one-dimensional problem, interpolating between VI-like algorithms and MCMC-like algorithms as β goes from 0 (VI) to 1 (MCMC).

1d_example_ver

(Admittedly, this might not make that much sense at this point.)

What is VI?

The goal of “inference” is to be able to evaluate expectations with respect to some “target” distribution p(z). Variational inference (VI) converts this problem into the minimization of the KL-divergence KL(q_w(z) \Vert p(z)) for some simple class of distributions q_w(z) parameterized by w. For example, if p is a mixture of two Gaussians (in red), and q is a single Gaussian (in blue), the VI optimization in one dimension would arrive at the solution below.

samps_lab0.00

What is MCMC?

Given that same target distribution p(z), Markov chain Monte Carlo creates a random walk over z. The random walk is carefully constructed so that if you run it a long time, the probability it will end up in any given state is proportional to p(z). You can picture it as follows.

samps1.00

Why would you want to interpolate between them?

In short, VI is only an approximate algorithm, but MCMC can be very slow. In practice, the difference can be enormous– MCMC may require many orders of magnitude more time just to equal the performance of VI. This presents a user with an awkward situation where if one chooses the best algorithm for each time horizon, there’s a “gap” between when VI finishes until and when MCMC is better. Informally, you can get performance that looks like this:

time_cartoon.png

Intuitively, it seems like something better should be achievable at those intermediate times.

But they are so different. Is it even possible to combine them?

Very roughly speaking, you can define a random walk over the space of variational distributions. Then, you trade off between how random the walk is and how random the distributions are. You arrive at something like this:

samps0.32.png

Put another way, both VI and MCMC seek “high probability” regions in z, but with different coverage strategies:

  • VI explicitly includes the entropy in its objective
  • MCMC injects randomness into its walk

It is therefore natural to define a random walk over w, where we trade off between “how random” the walk is and “how much” high entropy w are favored.

That’s fine intuition but how can you guarantee anything formally?

Yes! Or, at least, sort of. To define a bit of notation, we start with a fixed variational family q(z|w) and a target distribution p(z). Now, we want to define a distribution q(w) (so we can do a random walk) so that

q(z) = \int_w q(w) q(z|w) \approx p(z).

The natural goal would be to minimize the KL-divergence KL(q(Z) \Vert p(Z))=\int_z q(z) \log(q(z)/p(z)). That’s difficult since q(z) is defined by marginalizing q(w) out– you can’t evaluate it. What you can do is set up two upper-bounds on this quantity.

The first bound is the conditional divergence:

KL(q(Z) \Vert p(Z)) \leq D_0 := \int_w q(w) \int_z q(z|w) \frac{q(z|w)}{p(z)}

The second bound is the joint divergence. You need to augment p(z) with some distribution p(w|z) and then you have the bound

KL(q(Z) \Vert p(Z)) \leq D_1 := \int_w q(w) \int_z q(z|w) \frac{q(w)q(z|w)}{p(z)p(w|z)}

Since these are both upper-bounds, a convex combination of them will also be. Thus, the goal is to find the distribution q(w) that minimizes D_\beta = (1-\beta)D_0 + \beta D_1, for any \beta in the [0,1] interval.

What distribution optimizes that bound?

First, note that D_\beta depends on the choice of p(w|z). You get a valid upper-bound for any choice, but the tightness changes. The paper uses p(w|z) = r(w) q(z|w) / r_z where r_z = \int_w r(w) q(z|w) is a normalizing constant. Here, you can think of r(w) as something akin to a “base measure”. r_z is restricted to beconstant over z. (This isn’t a terrible restriction– it essentially means that if r(w) were a prior for q(z|w), it wouldn’t favor any point.)

Taking that choice of p(w|z) the solution turns out to be:

q^*(w) = \exp( s(w) - A)
s(w) = \log r(w) - \log r_z + E_{q(Z|w)} [\beta^{-1} \log p(Z) + (1-\beta^{-1}) \log q(Z|w)]
A = \log \int_w \exp s(w)

Furthermore, the actual value of the divergence bound at the solution turns out to be just the normalizing constant A up to a constant, i.e.

D^*_\beta = - \beta A.

How would you apply this to real algorithms?

To do anything concrete, you need to look at a specific VI algorithm and a specific MCMC algorithm. The paper uses

  • Langevin dynamics as an MCMC algorithm, which iterate the upates

    z \leftarrow z + \frac{\epsilon}{w} \nabla_z \log p(z) + \sqrt{\epsilon} \eta

    where \eta is noise from a standard Gaussian and \epsilon is a step-size.

  • Stochastic gradient VI which uses the iteration

    w \leftarrow w - \frac{\epsilon}{2} \nabla_w KL(q(Z|w) \Vert p(Z))

To get the novel algorithm in this paper, all that really needs to be done is to apply Langevin dynamics to the distribution q^*(w) derived above. Then, after a re-scaling, this becomes the new hybrid algorithm

w \leftarrow w + \frac{\epsilon}{2} \nabla_w \Big( KL(q(Z|w) \Vert p(Z)) - \beta H(w) + \beta \log r_\beta(w) \Big) + \sqrt{\beta \epsilon} \eta.

Here, H is the entropy of q_w. This clearly becomes the previous VI algorithm in the limit of \beta \rightarrow 0. It also essentially becomes Langevin with \beta \rightarrow 1. That’s because the distribution r_\beta (not yet defined!) will prefer w where q(Z|w) is highly concentrated. Thus, only the mean parameters of w matter, and sampling w becomes equivalent to just sampling z.

I feel like you’re skipping some details here.

Yes. First, the experiments below use a diagonal Gaussian for q(z|w) with w=(\mu, \nu) and \nu_i = \log_{10} \sigma_i. Second, Tthe gradient of the objective involves a KL-divergence. Exactly computing this is intractable, but can be approximated with standard tricks from stochastic VI, namely data subsampling and the “reparameterization trick”. Third, r_\beta needs to be chosen. The experiments below use the (improper) distribution r_\beta(w) \propto \prod_i \mathcal{N}(\nu_i \vert u_\beta,1) where u_\beta is a universal constant chosen for each \beta to minimize the divergence bound with p(z) is a standard Gaussian. (If you — like I– find this displeasing, see below.)

Are there some more pictures of samples?

Here’s a couple 2-D examples sampling from a “doughnut” distribution and a “three peaks” mixture distribution. Here, the samples are visualized by putting a curve at one standard deviation around the mean. Notice it smoothly becomes more “MCMC-like” as \beta increases.

three_peaks.png

donut.png

What about “real” problems? Can you show samples there?

Sure, but of course in more than 2 dimensions its hard to show samples. Here are some results sampling from a logistic regression model on the classic ionosphere dataset. As a comparison, I implemented the same model with STAN and ran it a huge amount of time to generate “presumed correct” samples. I then projected all samples to the first two principal components.

ionosphere_samples.png

(Note: technically what’s shown here is a sample z being drawn from each sampled w)

The top row shows the results after 104 iterations, the middle row after 105 and the bottom row after 106 You can roughly see that for small time horizons you are better off using a lower value of \beta but for higher time horizons you should use a larger value.

Do you get the desired speed/accuracy tradeoff?

Here, you need to compare the error each value of \beta creates at each time horizon. This is made difficult by the fact that you also need to select a step-size \epsilon and the best step-size changes depending on the time and \beta. To be as fair as possible, I ran 100 experiments with a range of step-sizes, and averaged the performance. Then, for each value of \beta and each time horizon, the results are shown with the best timestep. (Actually, this same procedure was used to generate the previous plots of samples as well.)

logreg_errors.png

The above plot shows the error (measured in MMD) on the y axis against time on the x-axis. Note that both axes are logarithmic. There are often several orders of magnitude of time horizons where an intermediate algorithm performs better than pure VI (β=0) or pure MCMC (β=1).

Is there anything that remains unsatisfying about this approach?

The most unsatisfying thing about this approach is the need to choose p(w|z). This is a bit disturbing, since this is not an object that “exists” in either the pure MCMC or pure VI worlds. On the other hand, there is a strong argument that it needs to exist here. If you carefully observe q^*(w) above, you’ll notice that it depends on the particular parameterization of w. So, e.g. if we “stretched out” part of the space of w this would change the marginal density q(z). That would be truly disturbing, but if p(w|z) is transformed in the opposite way, it would counteract that. So, p(w|z) needs to exist to reflect how we’ve parameterized w.

On the other hand, simply picking a single static distribution r_\beta(w) is pretty simplistic. (Recall, p(w|z) was defined in terms of r(w) above) It would be natural to try to adjust this distribution during inference to tighten the bound D^*_\beta. Using the fact that D^*_\beta=-\beta A you can show that it’s possible to find derivatives of D^*_\beta with respect to the parameters of r online, and thus tighten the bound while the algorithm is running. (I didn’t want to do this in this paper since neither VI nor MCMC do this, and it complicates the interpretation of the experiments.)

Finally, the main question is if this can be extended to other pairs of VI / MCMC algorithms. I actually first derived this algorithm by looking at simple discrete graphical models, e.g. Ising models. There, you can use the algorithms:

  • VI: Use a fully-factorized variational distribution and single-site coordinate ascent updates
  • MCMC: Use single-site Gibbs sampling.

You do in fact get a useful hybrid algorithm in the middle. However, the unfortunate reality is that both of the endpoints are considered pretty bad algorithms, so its hard to get too excited about the interpolation.

Finally, do note that there are other ideas out there for combining MCMC and VI. However, these usually fall into the camps of “putting MCMC inside of VI” [7] [8] [9] or “putting VI inside of MCMC” [10], rather than a straight interpolation of the two.

Personal opinions about graphical models 1: The surrogate likelihood exists and you should use it.

When talking about graphical models with people  (particularly computer vision folks) I find myself advancing a few opinions over and over again.  So, in an effort to stop bothering people at conferences, I thought I’d write a few entries here.

The first thing I’d like to discuss is “surrogate likelihood” training.  (So far as I know, Martin Wainwright was the first person to give a name to this method.)

Background

Suppose we want to fit a Markov random field (MRF).  I’m writing this as a generative model with an MRF for simplicity– pretty much the same story holds with a Conditional Random Field in the discriminative setting.

p({\bf x}) = \frac{1}{Z} \prod_{c} \psi({\bf x}_c) \prod_i \psi(y_i)

Here, the first product is over all cliques/factors in the graph, and the second is over all single variables.  Now, it is convenient to note that MRFs can be seen as members of the exponential family

p({\bf x};{\boldsymbol \theta}) = \exp( {\boldsymbol \theta} \cdot {\bf f}({\bf x}) - A({\boldsymbol \theta}) ),

where

{\bf f}({\bf X})=\{I[{\bf X}_{c}={\bf x}_{c}]|\forall c,{\bf x}_{c}\}\cup\{I[X_{i}=x_{i}]|\forall i,x_{i} \}

is a function consisting of indicator functions for each possible configuration of each clique and variable, and the log-partition function

A(\boldsymbol{\theta})=\log\sum_{{\bf x}}\exp\boldsymbol{\theta}\cdot{\bf f}({\bf x}).

ensures normalization.

Now, the log-partition function has the very important (and easy to show) property that the gradient is the expected value of \bf f.

\displaystyle \frac{dA}{d{\boldsymbol \theta}} = \sum_{\bf x} p({\bf x};{\boldsymbol \theta}) {\bf f}({\bf x})

With a graphical model, what does this mean?  Well, notice that the expected value of, say, I[X_i=x_i] will be exactly p(x_i;{\boldsymbol \theta}). Thus, the expected value of {\bf f} will be a vector containing all univariate and clique-wise marginals.  If we write this as {\boldsymbol \mu}({\boldsymbol \theta}), then we have

\displaystyle \frac{dA}{d{\boldsymbol \theta}} = {\boldsymbol \mu}({\boldsymbol \theta}).

The usual story

Suppose we want to do maximum likelihood learning.  This means we want to set {\boldsymbol \theta} to maximize

L( {\boldsymbol \theta} ) = \frac{1}{N}\sum_{\hat{{\bf x}}}\log p({\bf x};{\boldsymbol \theta})={\boldsymbol \theta}\cdot\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}})-A({\boldsymbol \theta}).

If we want to use gradient ascent, we would just take a small step along the gradient.  This has a very intuitive form: it is the difference of the expected value of \bf f under the model to the expected value of \bf f under the current distribution.

\displaystyle \frac{dL}{d{\boldsymbol \theta}} = \frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - \sum_{\bf x} p({\bf x};{\boldsymbol \theta}) {\bf f}({\bf x}).

\displaystyle \frac{dL}{d{\boldsymbol \theta}} = \frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - {\boldsymbol \mu}({\boldsymbol \theta}).

Note the lovely property of moment matching here. If we have found a solution, then dL/d{\boldsymbol \theta}=0 and so the expected value of \bf f under the current distribution will be exactly equal to that under the data.

Unfortunately, in a high-treewidth setting, we can’t compute the marginals.  That’s too bad.  However, we have all these lovely approximate inference algorithms (loopy belief propagation, tree-reweighted belief propagation, mean field, etc.).  Suppose we write the resulting approximate marginals as \tilde{{\boldsymbol \mu}}({\boldsymbol \theta}).  Then, instead of taking the above gradient step, why not instead just use

\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - \tilde{{\boldsymbol \mu}}({\boldsymbol \theta})?

That’s all fine!  However, I often see people say/imply/write some or all of the following:

  1. This is not guaranteed to converge.
  2. There is no longer any well-defined objective function being maximized.
  3. We can’t use line searches.
  4. We have to use (possibly stochastic) gradient ascent.
  5. This whole procedure is frightening and shouldn’t be mentioned in polite company.

I agree that we should view this procedure with some suspicion, but it gets far more than it deserves! The first four points, in my view, are simply wrong.

What’s missing

The critical thing that is missing from the above story is this: Approximate marginals come together with an approximate partition function!

That is, if you are computing approximate marginals \tilde{{\boldsymbol \mu}}({\boldsymbol \theta}) using loopy belief propagation, mean-field, or tree-reweighted belief propagation, there is a well-defined approximate log-partition function \tilde{A}({\boldsymbol \theta}) such that

\displaystyle \tilde{{\boldsymbol \mu}}({\boldsymbol \theta}) = \frac{d\tilde{A}}{d{\boldsymbol \theta}}.

What this means is that you should think, not of approximating the likelihood gradient, but of approximating the likelihood itself. Specifically, what the above is really doing is optimizing the “surrogate likelihood”

\tilde{L}({\boldsymbol \theta}) = {\boldsymbol \theta}\cdot\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}})-\tilde{A}({\boldsymbol \theta}).

What’s the gradient of this? It is

\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - \tilde{{\boldsymbol \mu}}({\boldsymbol \theta}),

or exactly the gradient that was being used above. The advantage of doing things this way is that it is a normal optimization.  There is a well-defined objective. It can be plugged into a standard optimization routine, such as BFGS, which will probably be faster than gradient ascent.  Line searches guarantee convergence. \tilde{A} is perfectly tractable to compute. In fact, if you have already computed approximate marginals, \tilde{A} has almost no cost. Life is good.

The only counterargument I can think of is that mean-field and loopy BP can have different local optima, which might mean that a no-line-search-refuse-to-look-at-the-objective-function-just-follow-the-gradient-and-pray style optimization could be more robust, though I’d like to see that argument made…

I’m not sure of the history, but I think part of the reason this procedure has such a bad reputation (even from people that use it!) might be that it predates the “modern” understanding of inference procedures as producing approximate partition functions as well as approximate marginals.