(All based on these excellent slides from Umberto Picchini)
“Approximate Bayesian Computation” sounds like a broad class of methods that would potentially include things like message passing, variational methods, MCMC, etc. However, for historical reasons, the term is used for a very specialized class of methods.
The core idea is as follows:
Sample from the posterior using rejection sampling, with the accept/reject decision made by generating a synthetic dataset and comparing it to the observed one.
The basic idea
Take a model . Assume we observe some fixed
and want to sample from
Assume
is discrete.
Algorithm (Basic ABC):
- Sample
- If
, return
- Else, repeat.
Claim: This algorithm returns an exact sample from the posterior
Proof: The probability of returning is the probability of (i) drawing
from the prior, and (ii) sampling
conditional on
. Thus
The probability of returning in any one iteration is the posterior times the constant
So this gives exact samples.
Why this is interesting.
There’s two special properties:
- It gives exact samples from the posterior. For so many Bayesian methods, it’s hard to know if you have a good answer or not. Here, if the algorithm successfully returns anything, you have an exact sample.
-
It works under minimal assumptions about the target model. All that’s needed is (i) To be able to simulate
and (ii) to be able to simulate
. You don’t even need to be able to evaluate either of these!
Why this works badly in general.
Of course, the problem is that you’ll typically be unlikely to exactly hit . Formally speaking, the probability of returning anything in a given loop is
In high dimensions, will typically be small unless you tend to get the same data regardless of the value of the latent variable. (In which case is the problem even interesting?)
It’s just rejection sampling
This is almost exactly rejection sampling. Remember that in general if you want to sample from you need a proposal distribution
you can sample from and you need to know a constant such that
The above algorithm is just using
Then, is a valid proposal since
is equivalent to
which is always true.
Why isn’t this exactly rejection sampling? In traditional descriptions of rejection sampling, you’d need to calculate and
. In the ABC setting we can’t calculate either of these, but we exploit that we can calculate the ratio
Adding an
To increase the chance of accepting (or make the algorithm work at all if is continuous) you need to add a “slop factor” of
. You change the algorithm to instead accept if
for some small
. The value of
introduces an accuracy computation tradeoff. However, this doesn’t solve the fundamental problem if things don’t scale that well to high dimensions.
Summary statistics
Another idea to reduce expense is to instead compare summary statistics. That is, find some function and accept if
rather than if
as before.
If we make this change, then the probability of returning in any one iteration is
Above we define and
The probability of returning anything in a given round is
There’s good news and bad news about making this change.
Good news:
- We have improved the acceptance rate from
to
This could be much higher if there are many different datasets that yield the same summary statistics.
Bad news:
- This introduces errors in general. To avoid introducing error, we need that
Exponential family
Often, summary statistics are used even though they introduce errors. It seems to be a bit of a craft to find good summary statistics to speed things up without introducing too much error.
There is on appealing case where no error is introduced. Suppose is in the exponential family and
are the sufficient statistics for that family. Then, we know that
. This is very nice.
Slop factors
If you’re using a slop factor, you can instead accept according to
This introduces the same kind of computation / accuracy tradeoff.
ABC-MCMC (Or likelihood-free MCMC)
Before getting to ABC-MCMC, suppose we just wanted for some reason to use Metropolis-Hastings to sample from the prior . If our proposal distribution was
then we’d do
Algorithm: (Regular old Metropolis-Hastings)
- Initialize
somehow
- Repeat:
- Propose
from some proposal distribution
- Compute acceptance probability
.
- Generate
- If
then
.
- Propose
Now suppose we instead want to sample from the posterior instead. We will suggest the following algorithm instead, with changes shown in blue.
Algorithm: (ABC-MCMC)
- Initialize
somehow and initialize
- Repeat:
- Propose
.
- Sample a synthetic dataset
.
- Compute acceptance probability
.
- Generate
- If
then
.
- Propose
There is only one difference: After proposing , we generate a synthetic dataset. We can accept only if the synthetic dataset is the same as the observed one.
What this solves
There are two computational problems that the original ABC algorithm can encounter:
- The prior
may be a terrible proposal distribution for the posterior
. Maybe random samples from the prior almost never yield datasets similar to the observed.
- Even with a good proposal
, the acceptance probability
might be very low.
The MCMC-ABC algorithm seems intended to deal with the first problem: If the proposal distributon only yields nearby points, than once the typical set has been reached, the probability of propsing a “good”
is much higher.
On the other hand, MCMC-ABC algorithm seems to do little to address the second problem.
Justification
Now, why is this a correct algorithm? Consider the augmented distribution
We now want to sample from using Metropolis-Hastings. We choose the proposal distribution
The acceptance probability will then be
Since the original state was accepted, it must be true that
Thus, the above can be simplified into
Generalizations
If using summary statistics, you change into
You can also add a slop factor if you want.
More generally still, we could instead use the augmented distribution
The proposal can be something interesting like a Multivariate Gaussian. The acceptance probability then instead becomes
Of course, this introduces some error.
Choosing
In practice, a good value of at the end will lead to very slow progress at the beginning. Best to slowly reduce
over time. Seems like shooting for a low 1% acceptance rate at the end is a good compromise. A higher acceptance rate would mean that too much error was introduced.
(Thanks to Javier Burroni for helpful comments.)