Back to Basics: Approximate Bayesian Computation

(All based on these excellent slides from Umberto Picchini)

“Approximate Bayesian Computation” sounds like a broad class of methods that would potentially include things like message passing, variational methods, MCMC, etc. However, for historical reasons, the term is used for a very specialized class of methods.

The core idea is as follows:

Sample from the posterior using rejection sampling, with the accept/reject decision made by generating a synthetic dataset and comparing it to the observed one.

The basic idea

Take a model {p(z,x)}. Assume we observe some fixed {x} and want to sample from {p(z|x).} Assume {x} is discrete.

Algorithm (Basic ABC):

  • Sample {(z^{*},x^{*})\sim p(z,x)}
  • If {x^{*}=x}, return {z^{*}}
  • Else, repeat.

Claim: This algorithm returns an exact sample from the posterior {p(z|x).}

Proof: The probability of returning {z^{*}} is the probability of (i) drawing {z^{*}} from the prior, and (ii) sampling {x} conditional on {z}. Thus

\displaystyle  \begin{array}{rcl}  \text{probability of returning }z^{*} & = & p(z^{*})p(x\vert z^{*})\\ & = & p(z^{*},x)\\ & = & p(x)p(z^{*}\vert x). \end{array}

The probability of returning {z^{*}} in any one iteration is the posterior times the constant {p(x).} So this gives exact samples.

Why this is interesting.

There’s two special properties:

  • It gives exact samples from the posterior. For so many Bayesian methods, it’s hard to know if you have a good answer or not. Here, if the algorithm successfully returns anything, you have an exact sample.
  • It works under minimal assumptions about the target model. All that’s needed is (i) To be able to simulate p(z) and (ii) to be able to simulate p(x|z). You don’t even need to be able to evaluate either of these!

Why this works badly in general.

Of course, the problem is that you’ll typically be unlikely to exactly hit {x}. Formally speaking, the probability of returning anything in a given loop is

\displaystyle  \begin{array}{rcl}  \sum_{z^{*}}\text{probability of returning }z^{*} & = & p(x). \end{array}

In high dimensions, {p(x)} will typically be small unless you tend to get the same data regardless of the value of the latent variable. (In which case is the problem even interesting?)

It’s just rejection sampling

This is almost exactly rejection sampling. Remember that in general if you want to sample from {P(z)} you need a proposal distribution {Q(z)} you can sample from and you need to know a constant such that {\frac{P(z)}{M}\leq Q(z).} The above algorithm is just using

\displaystyle  \begin{array}{rcl}  P(z) & = & p(z|x^{*})\\ Q(z) & = & p(z)\\ M & = & \frac{1}{p(x^{*})}. \end{array}

Then, {Q(z)} is a valid proposal since {P(z)/M\leq Q(z)} is equivalent to {p(z,x^{*})\leq p(z)} which is always true.

Why isn’t this exactly rejection sampling? In traditional descriptions of rejection sampling, you’d need to calculate {P(z)} and {M}. In the ABC setting we can’t calculate either of these, but we exploit that we can calculate the ratio {P(z)/M.}

Adding an {\epsilon}

To increase the chance of accepting (or make the algorithm work at all if {x} is continuous) you need to add a “slop factor” of {\epsilon}. You change the algorithm to instead accept if \displaystyle  d(x,x^{*})\leq\epsilon for some small {\epsilon}. The value of {\epsilon} introduces an accuracy computation tradeoff. However, this doesn’t solve the fundamental problem if things don’t scale that well to high dimensions.

Summary statistics

Another idea to reduce expense is to instead compare summary statistics. That is, find some function {s(x)} and accept if \displaystyle  s(x)=s(x^{*}). rather than if \displaystyle x=x^* as before.

If we make this change, then the probability of returning \displaystyle z^* in any one iteration is

\displaystyle  \begin{array}{rcl}  \text{probability of returning }z^{*} & = & p(z^{*})\sum_{x}p(x\vert z^{*})I[s(x)=s(x^{*})].\\ & = & p(z^{*})p(s(x)\vert z^{*})\\ & = & p(s(x))p(z^{*}\vert s(x)). \end{array}

Above we define {p(s'\vert z)=\sum_{x}p(x\vert z)I[s'=s(x)]} and {p(s')=\sum_{x}p(x)I[s(x)=s'].}

The probability of returning anything in a given round is \displaystyle  \begin{array}{rcl}  \sum_{z^{*}}\text{probability of returning }z^{*} & = & p(s(x)). \end{array}

There’s good news and bad news about making this change.

Good news:

  • We have improved the acceptance rate from {p(x)} to {p(s(x)).} This could be much higher if there are many different datasets that yield the same summary statistics.

Bad news:

  • This introduces errors in general. To avoid introducing error, we need that {p(z\vert s(x))=p(z\vert x).}

Exponential family

Often, summary statistics are used even though they introduce errors. It seems to be a bit of a craft to find good summary statistics to speed things up without introducing too much error.

There is on appealing case where no error is introduced. Suppose {p(x|z)} is in the exponential family and {s(x)} are the sufficient statistics for that family. Then, we know that {p(z|x)=p(z\vert s(x))}. This is very nice.

Slop factors

If you’re using a slop factor, you can instead accept according to \displaystyle  d(s(x),s(x^{*}))\leq\epsilon.
This introduces the same kind of computation / accuracy tradeoff.

ABC-MCMC (Or likelihood-free MCMC)

Before getting to ABC-MCMC, suppose we just wanted for some reason to use Metropolis-Hastings to sample from the prior {p(z).}. If our proposal distribution was q(\tilde{z}\vert z) then we’d do

Algorithm: (Regular old Metropolis-Hastings)

  • Initialize {z} somehow
  • Repeat:
    • Propose {\tilde{z}\sim q(\tilde{z}\vert z)} from some proposal distribution
    • Compute acceptance probability \displaystyle {\alpha\leftarrow\frac{p(\tilde{z})}{p(z)}\frac{q(z\vert\tilde{z})}{q(\tilde{z}\vert z)}}.
    • Generate {r\sim U(0,1).}
    • If {r\leq\alpha} then {z\leftarrow\tilde{z}}.

Now suppose we instead want to sample from the posterior {p(z|x)} instead. We will suggest the following algorithm instead, with changes shown in blue.

Algorithm: (ABC-MCMC)

  • Initialize {z} somehow and initialize {x^{*}=x.}
  • Repeat:
    • Propose {\tilde{z}\sim q(\tilde{z}\vert z)}.
    • Sample a synthetic dataset {\tilde{x}^{*}\sim p(\tilde{x}^{*}\vert\tilde{z})}.
    • Compute acceptance probability {{\displaystyle \alpha\leftarrow\frac{p(\tilde{z})}{p(z)}\frac{q(z\vert\tilde{z})}{q(\tilde{z}\vert z)}}} {{\displaystyle I[\tilde{x}^{*}=x^{*}]}}.
    • Generate {r\sim U(0,1).}
    • If {r\leq\alpha} then {z\leftarrow\tilde{z}}.

There is only one difference: After proposing {\tilde{z}}, we generate a synthetic dataset. We can accept only if the synthetic dataset is the same as the observed one.

What this solves

There are two computational problems that the original ABC algorithm can encounter:

  • The prior {p(z)} may be a terrible proposal distribution for the posterior {p(z|x)}. Maybe random samples from the prior almost never yield datasets similar to the observed.
  • Even with a good proposal {z}, the acceptance probability {p(x)} might be very low.

The MCMC-ABC algorithm seems intended to deal with the first problem: If the proposal distributon {q(\tilde{z}\vert z)} only yields nearby points, than once the typical set has been reached, the probability of propsing a “good” {\tilde{z}} is much higher.

On the other hand, MCMC-ABC algorithm seems to do little to address the second problem.

Justification

Now, why is this a correct algorithm? Consider the augmented distribution \displaystyle  p(z,x^{*},x)=p(z,x^{*})I[x^{*}=x].

We now want to sample from {p(z,x^{*}\vert x)} using Metropolis-Hastings. We choose the proposal distribution

\displaystyle  q(\tilde{z},\tilde{x}^{*}\vert z,x^{*})=q(\tilde{z}\vert z)p(\tilde{x}^{*}\vert\tilde{z}).

The acceptance probability will then be

\displaystyle  \begin{array}{rcl}  \alpha & = & \min\left(1,\frac{p(\tilde{z},\tilde{x}^{*},x)}{p(z,x^{*},x)}\frac{q(z,x^{*}\vert\tilde{z},\tilde{x}^{*})}{q(\tilde{z},\tilde{x}^{*}\vert z,x^{*})}\right)\\ & = & \min\left(1,\frac{p(\tilde{z},\tilde{x}^{*})I[\tilde{x}^{*}=x]}{p(z,x^{*})I[x^{*}=x]}\frac{q(z\vert\tilde{z})p(x^{*}\vert z)}{q(\tilde{z}\vert z)p(\tilde{x}^{*}\vert\tilde{z})}\right) \end{array} .

Since the original state {(z,x^{*})} was accepted, it must be true that {x^{*}=x.} Thus, the above can be simplified into

\displaystyle  \begin{array}{rcl}  \alpha & = & \min\left(1,\frac{p(\tilde{z})}{p(z)}\frac{q(z\vert\tilde{z})}{q(\tilde{z}\vert z)}\right)I[\tilde{x}^{*}=x]. \end{array}

Generalizations

If using summary statistics, you change {I[\tilde{x}^{*}=x]} into {I[s(\tilde{x}^{*})=s(x)].} You can also add a slop factor if you want.

More generally still, we could instead use the augmented distribution

\displaystyle  p(z,x^{*},x)=p(z,x^{*})p(x\vert x^{*}).

The proposal {p(x\vert x^{*})} can be something interesting like a Multivariate Gaussian. The acceptance probability then instead becomes

\displaystyle  \begin{array}{rcl}  \alpha & = & \min\left(1,\frac{p(\tilde{z})p(x\vert\tilde{x}^{*})}{p(z)p(x\vert x^{*})}\frac{q(z\vert\tilde{z})}{q(\tilde{z}\vert z)}\right). \end{array}

Of course, this introduces some error.

Choosing {\epsilon}

In practice, a good value of {\epsilon} at the end will lead to very slow progress at the beginning. Best to slowly reduce {\epsilon} over time. Seems like shooting for a low 1% acceptance rate at the end is a good compromise. A higher acceptance rate would mean that too much error was introduced.

(Thanks to Javier Burroni for helpful comments.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s