Calculating variance in the log domain

Say you’ve got a positive dataset and you want to calculate the variance. However, the numbers in your dataset are huge, so huge you need to represent them in the log-domain. How do you compute the log-variance without things blowing up?

I ran into the problem today. To my surprise, I couldn’t find a standard solution.

The bad solution

Suppose that your data is x_1 \cdots x_n, which you have stored as l_1 \cdots l_n where x_i = \exp(l_i). The obvious thing to do is to just exponentiate and then compute the variance. That would be something like the following:

\text{For all } i\in{1\cdots n,} \text{ set} x_i = \exp(l_i)

\text{var} = \frac{1}{n} \sum_{i=1}^n (x_i - \frac{1}{n} \sum_{j=1}^n x_j)^2

This of course is a terrible idea: When l_i is large, you can’t even write down x_i without running into numerical problems.

The mediocre solution

The first idea I had for this problem was relatively elegant. We can of course represent the variance as

\text{var} = \bar{x^2} - \bar{x}^2.

Instead of calculating \bar{x} and \bar{x^2}, why not calculate the log of these quantities?

To do this, we can introduce a “log domain mean” operator, a close relative of the good-old scipy.special.logsumexp

def log_domain_mean(logx):
    "np.log(np.mean(np.exp(x))) but more stable"
    n = len(logx)
    damax = np.max(logx)
    return np.log(np.sum(np.exp(logx-damax))) \
    + damax-np.log(n)

Next, introduce a “log-sub-add” operator. (A variant of np.logaddexp)

def logsubadd(a,b):
    "np.log(np.exp(a)-np.exp(b)) but more stable"
    return a + np.log(1-np.exp(b-a))

Then, we can compute the log-variance as

def log_domain_var(logx):
    a = log_domain_mean(2*logx)
    b = log_domain_mean(logx)*2
    c = logsubadd(a,b)
    return c

Here a is \log \bar{x^2} while b is \log \bar{x}^2.

Nice, right? Well, it’s much better then the first solution. But it isn’t that good. The problem is that when the variance is small, a and b are close. When they are both close and large, logsubadd runs into numerical problems. It’s not clear that there is a way to fix this problem with logsubadd.

To solve this, abandon elegance!

The good solution

For the good solution, the math is a series of not-too-intuitive transformations. (I put them at the end.) These start with

\text{var} = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2

and end with

\log \text{var} = \log \sum_{i=1}^n (\exp (l_i - 2\log \bar{x}) - 1)^2 + 2 \log \bar{x} - \log(n).

Why this form? Well, we’ve reduced to things we can do relatively stably: Compute the log-mean, and do a (small variant of) log-sum-exp.

def log_domain_var(logx):
    """like np.log(np.var(np.exp(logx)))
    except more stable"""
    n = len(logx)
    log_xmean = log_domain_mean(logx)
    return np.log(np.sum( np.expm1(logx-log_xmean)**2))\
    + 2*log_xmean - np.log(n)

This uses the log_domain_mean implementation from above, and also np.expm1 to compute \exp(a)-1 in a more stable wauy when a is close to zero.

Why is this stable? Is it really stable? Well, umm, I’m not sure. I derived transformations that “looked stable” to me, but there’s no proof that this is best. I’d be surprised if a better solution wasn’t possible. (I’d also be surprised if there isn’t a paper from 25+ years ago that describes that better solution.)

In any case, I’ve experimentally found that this function will (while working in single precision) happily compute the variance even when logx is in the range of -10^{30} to 10^{30}, which is about 28 orders of magnitude better than the naive solution and sufficient for my needs.

As always, failure cases are probably out there. Numerical instability always wins when it can be bothered to make an effort.

Appendix: The transformations

\text{var} = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2

\text{var} = \frac{1}{n} \sum_{i=1}^n (\exp (l_i) - \bar{x})^2

\text{var} = \frac{1}{n} \sum_{i=1}^n (\exp (l_i) / \bar{x}^2 - 1)^2 \bar{x}^2

\text{var} = \frac{1}{n} \sum_{i=1}^n (\exp (l_i - \log \bar{x}^2) - 1)^2 \bar{x}^2

\text{var} = \frac{1}{n} \sum_{i=1}^n (\exp (l_i - 2\log \bar{x}) - 1)^2 \bar{x}^2

\log \text{var} = \log \sum_{i=1}^n (\exp (l_i - 2\log \bar{x}) - 1)^2 + 2 \log \bar{x} - \log(n)

Back to Basics: Approximate Bayesian Computation

(All based on these excellent slides from Umberto Picchini)

“Approximate Bayesian Computation” sounds like a broad class of methods that would potentially include things like message passing, variational methods, MCMC, etc. However, for historical reasons, the term is used for a very specialized class of methods.

The core idea is as follows:

Sample from the posterior using rejection sampling, with the accept/reject decision made by generating a synthetic dataset and comparing it to the observed one.

The basic idea

Take a model {p(z,x)}. Assume we observe some fixed {x} and want to sample from {p(z|x).} Assume {x} is discrete.

Algorithm (Basic ABC):

  • Sample {(z^{*},x^{*})\sim p(z,x)}
  • If {x^{*}=x}, return {z^{*}}
  • Else, repeat.

Claim: This algorithm returns an exact sample from the posterior {p(z|x).}

Proof: The probability of returning {z^{*}} is the probability of (i) drawing {z^{*}} from the prior, and (ii) sampling {x} conditional on {z}. Thus

\displaystyle  \begin{array}{rcl}  \text{probability of returning }z^{*} & = & p(z^{*})p(x\vert z^{*})\\ & = & p(z^{*},x)\\ & = & p(x)p(z^{*}\vert x). \end{array}

The probability of returning {z^{*}} in any one iteration is the posterior times the constant {p(x).} So this gives exact samples.

Why this is interesting.

There’s two special properties:

  • It gives exact samples from the posterior. For so many Bayesian methods, it’s hard to know if you have a good answer or not. Here, if the algorithm successfully returns anything, you have an exact sample.
  • It works under minimal assumptions about the target model. All that’s needed is (i) To be able to simulate p(z) and (ii) to be able to simulate p(x|z). You don’t even need to be able to evaluate either of these!

Why this works badly in general.

Of course, the problem is that you’ll typically be unlikely to exactly hit {x}. Formally speaking, the probability of returning anything in a given loop is

\displaystyle  \begin{array}{rcl}  \sum_{z^{*}}\text{probability of returning }z^{*} & = & p(x). \end{array}

In high dimensions, {p(x)} will typically be small unless you tend to get the same data regardless of the value of the latent variable. (In which case is the problem even interesting?)

It’s just rejection sampling

This is almost exactly rejection sampling. Remember that in general if you want to sample from {P(z)} you need a proposal distribution {Q(z)} you can sample from and you need to know a constant such that {\frac{P(z)}{M}\leq Q(z).} The above algorithm is just using

\displaystyle  \begin{array}{rcl}  P(z) & = & p(z|x^{*})\\ Q(z) & = & p(z)\\ M & = & \frac{1}{p(x^{*})}. \end{array}

Then, {Q(z)} is a valid proposal since {P(z)/M\leq Q(z)} is equivalent to {p(z,x^{*})\leq p(z)} which is always true.

Why isn’t this exactly rejection sampling? In traditional descriptions of rejection sampling, you’d need to calculate {P(z)} and {M}. In the ABC setting we can’t calculate either of these, but we exploit that we can calculate the ratio {P(z)/M.}

Adding an {\epsilon}

To increase the chance of accepting (or make the algorithm work at all if {x} is continuous) you need to add a “slop factor” of {\epsilon}. You change the algorithm to instead accept if \displaystyle  d(x,x^{*})\leq\epsilon for some small {\epsilon}. The value of {\epsilon} introduces an accuracy computation tradeoff. However, this doesn’t solve the fundamental problem if things don’t scale that well to high dimensions.

Summary statistics

Another idea to reduce expense is to instead compare summary statistics. That is, find some function {s(x)} and accept if \displaystyle  s(x)=s(x^{*}). rather than if \displaystyle x=x^* as before.

If we make this change, then the probability of returning \displaystyle z^* in any one iteration is

\displaystyle  \begin{array}{rcl}  \text{probability of returning }z^{*} & = & p(z^{*})\sum_{x}p(x\vert z^{*})I[s(x)=s(x^{*})].\\ & = & p(z^{*})p(s(x)\vert z^{*})\\ & = & p(s(x))p(z^{*}\vert s(x)). \end{array}

Above we define {p(s'\vert z)=\sum_{x}p(x\vert z)I[s'=s(x)]} and {p(s')=\sum_{x}p(x)I[s(x)=s'].}

The probability of returning anything in a given round is \displaystyle  \begin{array}{rcl}  \sum_{z^{*}}\text{probability of returning }z^{*} & = & p(s(x)). \end{array}

There’s good news and bad news about making this change.

Good news:

  • We have improved the acceptance rate from {p(x)} to {p(s(x)).} This could be much higher if there are many different datasets that yield the same summary statistics.

Bad news:

  • This introduces errors in general. To avoid introducing error, we need that {p(z\vert s(x))=p(z\vert x).}

Exponential family

Often, summary statistics are used even though they introduce errors. It seems to be a bit of a craft to find good summary statistics to speed things up without introducing too much error.

There is on appealing case where no error is introduced. Suppose {p(x|z)} is in the exponential family and {s(x)} are the sufficient statistics for that family. Then, we know that {p(z|x)=p(z\vert s(x))}. This is very nice.

Slop factors

If you’re using a slop factor, you can instead accept according to \displaystyle  d(s(x),s(x^{*}))\leq\epsilon.
This introduces the same kind of computation / accuracy tradeoff.

ABC-MCMC (Or likelihood-free MCMC)

Before getting to ABC-MCMC, suppose we just wanted for some reason to use Metropolis-Hastings to sample from the prior {p(z).}. If our proposal distribution was q(\tilde{z}\vert z) then we’d do

Algorithm: (Regular old Metropolis-Hastings)

  • Initialize {z} somehow
  • Repeat:
    • Propose {\tilde{z}\sim q(\tilde{z}\vert z)} from some proposal distribution
    • Compute acceptance probability \displaystyle {\alpha\leftarrow\frac{p(\tilde{z})}{p(z)}\frac{q(z\vert\tilde{z})}{q(\tilde{z}\vert z)}}.
    • Generate {r\sim U(0,1).}
    • If {r\leq\alpha} then {z\leftarrow\tilde{z}}.

Now suppose we instead want to sample from the posterior {p(z|x)} instead. We will suggest the following algorithm instead, with changes shown in blue.

Algorithm: (ABC-MCMC)

  • Initialize {z} somehow and initialize {x^{*}=x.}
  • Repeat:
    • Propose {\tilde{z}\sim q(\tilde{z}\vert z)}.
    • Sample a synthetic dataset {\tilde{x}^{*}\sim p(\tilde{x}^{*}\vert\tilde{z})}.
    • Compute acceptance probability {{\displaystyle \alpha\leftarrow\frac{p(\tilde{z})}{p(z)}\frac{q(z\vert\tilde{z})}{q(\tilde{z}\vert z)}}} {{\displaystyle I[\tilde{x}^{*}=x^{*}]}}.
    • Generate {r\sim U(0,1).}
    • If {r\leq\alpha} then {z\leftarrow\tilde{z}}.

There is only one difference: After proposing {\tilde{z}}, we generate a synthetic dataset. We can accept only if the synthetic dataset is the same as the observed one.

What this solves

There are two computational problems that the original ABC algorithm can encounter:

  • The prior {p(z)} may be a terrible proposal distribution for the posterior {p(z|x)}. Maybe random samples from the prior almost never yield datasets similar to the observed.
  • Even with a good proposal {z}, the acceptance probability {p(x)} might be very low.

The MCMC-ABC algorithm seems intended to deal with the first problem: If the proposal distributon {q(\tilde{z}\vert z)} only yields nearby points, than once the typical set has been reached, the probability of propsing a “good” {\tilde{z}} is much higher.

On the other hand, MCMC-ABC algorithm seems to do little to address the second problem.

Justification

Now, why is this a correct algorithm? Consider the augmented distribution \displaystyle  p(z,x^{*},x)=p(z,x^{*})I[x^{*}=x].

We now want to sample from {p(z,x^{*}\vert x)} using Metropolis-Hastings. We choose the proposal distribution

\displaystyle  q(\tilde{z},\tilde{x}^{*}\vert z,x^{*})=q(\tilde{z}\vert z)p(\tilde{x}^{*}\vert\tilde{z}).

The acceptance probability will then be

\displaystyle  \begin{array}{rcl}  \alpha & = & \min\left(1,\frac{p(\tilde{z},\tilde{x}^{*},x)}{p(z,x^{*},x)}\frac{q(z,x^{*}\vert\tilde{z},\tilde{x}^{*})}{q(\tilde{z},\tilde{x}^{*}\vert z,x^{*})}\right)\\ & = & \min\left(1,\frac{p(\tilde{z},\tilde{x}^{*})I[\tilde{x}^{*}=x]}{p(z,x^{*})I[x^{*}=x]}\frac{q(z\vert\tilde{z})p(x^{*}\vert z)}{q(\tilde{z}\vert z)p(\tilde{x}^{*}\vert\tilde{z})}\right) \end{array} .

Since the original state {(z,x^{*})} was accepted, it must be true that {x^{*}=x.} Thus, the above can be simplified into

\displaystyle  \begin{array}{rcl}  \alpha & = & \min\left(1,\frac{p(\tilde{z})}{p(z)}\frac{q(z\vert\tilde{z})}{q(\tilde{z}\vert z)}\right)I[\tilde{x}^{*}=x]. \end{array}

Generalizations

If using summary statistics, you change {I[\tilde{x}^{*}=x]} into {I[s(\tilde{x}^{*})=s(x)].} You can also add a slop factor if you want.

More generally still, we could instead use the augmented distribution

\displaystyle  p(z,x^{*},x)=p(z,x^{*})p(x\vert x^{*}).

The proposal {p(x\vert x^{*})} can be something interesting like a Multivariate Gaussian. The acceptance probability then instead becomes

\displaystyle  \begin{array}{rcl}  \alpha & = & \min\left(1,\frac{p(\tilde{z})p(x\vert\tilde{x}^{*})}{p(z)p(x\vert x^{*})}\frac{q(z\vert\tilde{z})}{q(\tilde{z}\vert z)}\right). \end{array}

Of course, this introduces some error.

Choosing {\epsilon}

In practice, a good value of {\epsilon} at the end will lead to very slow progress at the beginning. Best to slowly reduce {\epsilon} over time. Seems like shooting for a low 1% acceptance rate at the end is a good compromise. A higher acceptance rate would mean that too much error was introduced.

(Thanks to Javier Burroni for helpful comments.)

Three opinions on theorems

1. Think of theorem statements like an API.

Some people feel intimidated by the prospect of putting a “theorem” into their papers. They feel that their results aren’t “deep” enough to justify this. Instead, they give the derivation and result inline as part of the normal text.

Sometimes that’s best. However, the purpose of a theorem is not to shout to the world that you’ve discovered something incredible. Rather, theorems are best thought of as an “API for ideas”. There are two basic purposes:

  • To separate what you are claiming from your argument for that claim.
  • To provide modularity to make it easier to understand or re-use your ideas.

To decide if you should create a theorem, ask if these goals will be advanced by doing so.

A thoughtful API makes software easier to use: The goal is to allow the user as much functionality as possible with as simple an interface as possible, while isolating implementation details. If you have a long chain a mathematical argument, you should choose what parts to write as theorems/lemmas in much the same way.

Many paper intermingle definitions, assumptions, proof arguments, and the final results. Have pity on the reader, and tell them in a single place what you are claiming, and under what assumptions. The “proof” section separates your evidence for your claim from the claim itself. Most readers want to understand your result before looking at the proof.  Let them. Don’t make them hunt to figure out what your final result it.

Perhaps controversially, I suggest you should use the above reasoning even if you aren’t being fully mathematically rigorous. It’s still kinder to the reader to state your assumptions informally.

As an example of why it’s helpful to explicitly state your results, here’s an example from a seminal paper, so I’m sure the authors don’t mind. (Click on the image for a larger excerpt.)

proof

This proof is well written. The problem is many small uncertainties that accumulate as you read it. If you try to understand exactly:

  • What result is being stated?
  • Under what assumptions does that result hold?

You will find that the proof “bleeds in” to the result itself. The convergence rate in 2.13 involves Q(\theta) defined in 2.10, which itself involves other assumptions tracing backwards in the paper.

Of course, not every single claim needs to be written as a theorem/lemma/claim. If your result is simple to state and will only be used in a “one-off” manner, it may be clearer just to leave it in the text. That’s analogous to “inlining” a small function instead of creating another one.

2. Don’t fear the giant equation block.

I sometimes see a proof like this (for 0<x<1)

Take the quantity

\frac{x-x^2}{\sqrt{x^2-2x+1}}.

Pulling out x this becomes

x \frac{1-x}{\sqrt{x^2-2x+1}}.

Factoring the denominator, this is

x \frac{1-x}{\sqrt{(x-1)(x-1)}}.

Etc.

For some proofs, the text between each line just isn’t that helpful. To a large degree it makes things more confusing– without an equality between the lines, you need to read the words to understand how each formula is supposed to be related to the previous one. Consider this alternative version of the proof:

\begin{aligned} \frac{x-x^2}{\sqrt{x^2-2x+1}}  = & x \frac{1-x}{\sqrt{x^2-2x+1}} \\ = & x \frac{1-x}{\sqrt{(x-1)(x-1)}} \\ = & x \frac{1-x}{\sqrt{(x-1)^2}} \\ = & x \frac{1-x}{1-x} \\ = & x \\ \end{aligned}

In some cases, this reveals the overall structure of the proof better than a bunch of lines with interspersed text. If explanation is needed, it can be better to put it at the end, e.g. as “where line 2 follows from [blah] and line 3 follows from [blah]”.

It can also be helpful to put these explanations inline, i.e. to us a proof like

\begin{aligned} \frac{x-x^2}{\sqrt{x^2-2x+1}} = & x \frac{1-x}{\sqrt{x^2-2x+1}} & \text{ pull out x} \\ = & x \frac{1-x}{\sqrt{(x-1)(x-1)}} & \text{ factoring denominator} \\ = & x \frac{1-x}{\sqrt{(x-1)^2}} & \\ = & x \frac{1-x}{1-x} & \text{ since denominator is positive} \\ = & x & \\ \end{aligned}

Again, this is not the best solution for all (or even most) cases, but I think it should be used more often than it is.

3. Use equivalence of inequalities when possible.

Many proofs involve manipulating chains of inequalities. When doing so, it should be obvious at what steps extra looseness may have been introduced. Suppose you have some positive constants a and c with a^2>c and you need to choose b so as to ensure that b^2+c \leq a^2.

People will often prove a result like the following:

Lemma: If b \leq \sqrt{a^2-c}, then b^2+c \leq a^2.

Proof: Under the stated condition, we have that

\begin{aligned} b^2 + c & \leq & (\sqrt{a^2-c})^2+c \\ & = & a^2-c+c \\ & = & a^2 \end{aligned}

That’s all correct, but doesn’t something feel slightly “magical” about the proof?

There are two problems: First, the proof reveals nothing anything about how you came up with the final answer. Second, the result leaves ambiguous if you have introduced additional looseness. Given the starting assumption, is it possible to prove a stronger bound?

I think the following lemma and proof are much better:

Lemma: b^2+c \leq a^2 if and only if b \leq \sqrt{a^2-c}.

Proof: The following conditions are all equivalent:

\begin{aligned} b^2+c & \leq & a^2 \\ b^2 & \leq & a^2-c \\ b & \leq & \sqrt{a^2-c}. \\ \end{aligned}

The proof shows exactly how you arrived at the final result, and shows that there is no extra looseness. It’s better not to “pull a rabbit out of a hat” in a proof if not necessary.

This is arguably one of the most basic possible proof techniques, but is bizarrely underused. I think there’s two reasons why:

  1. Whatever need motivated the lemma is probably met by the first one above. The benefit of the second is mostly in providing more insight.
  2. Mathematical notation doesn’t encourage it. The sentence at the beginning of the proof is essential. If you see this merely as a series of inequalities, each implied by the one before, than it will not give the “only if” part of the result. You could conceivably try to write something like a < b \Leftrightarrow \exp a < \exp b, but this is awkward with multiple lines.

Exponential Families Cheatsheet

As part of the graphical models course I taught last spring, I developed a “cheatsheet” for exponential families. The basic purpose is to explain the standard moment-matching condition of maximum likelihood. The goal of the sheet was to clearly show how this property generalized to maximum likelihood in conditional exponential families, with hidden variables, or both. It’s available as an image below, or as a PDF here. Please let me know about any errors!

exponential_family_cheatsheet4

 

I use the (surprisingly controversial) convention of using a sans-serif font for random variables, rather than capital letters. I’m convinced this is the least-bad option for the machine learning literature, where many readers seem to find capital letter random variables distracting. It also allows you to distinguish matrix-valued random variables, though that isn’t used here.

Statistics – the rules of the game

What is statistics about, really? It’s easy to go through a class and get the impression that it’s about manipulating intimidating formulas. But what’s the goal of them? Why did people invent them?

If you zoom out, the big picture is more conceptual than mathematical. Statistics has a crazy, grasping ambition: it wants to tell you how to best use observations to make decisions. For example, you might look at how much it rained each day in the last week, and decide if you should bring an umbrella today. Statistics converts data into ideal actions.

Here, I’ll try to explain this view. I think it’s possible to be quite precise about this while using almost no statistics background and extremely minimal math.

The two important characters that we meet are decision rules and loss functions. Informally, a decision rule is just some procedure that looks at a dataset and makes a choice. A loss function — a basic concept from decision theory– is a precise description of “how bad” a given choice is.

Model Problem: Coinflips

Let’s say you’re confronted with a coin where the odds of heads and tails are not known ahead of time. Still, you are allowed to observe how the coin performs over a number of flips. After that, you’ll need to make a “decision” about the coin. Explicitly:

  • You’ve got a coin, which comes up heads with probability w. You don’t know w.
  • You flip the coin n times.
  • You see k heads and n-k tails.
  • You do something, depending on k. (We’ll come back to this.)

Simple enough, right? Remember, k is the total number of heads after n flips. If you do some math, you can work out a formula for p(k\vert w,n): the probability of seeing exactly k heads. For our purposes, it doesn’t really matter what that formula is, just that it exists. It’s known as a Binomial distribution, and so is sometimes written \mathrm{Binomial}(k\vert n,w).

Here’s an example of what this looks like with n=21 and w=0.5.

bernoulli_21_0.5

Naturally enough, if w=.5, with 21 flips, you tend to see around 10-11 heads. Here’s an example with w=0.2. Here, the most common value is 4, close to 21\times.2=4.2.

bernoulli_21_0.2

Decisions, decisions

After observing some coin flips, what do we do next? You can imagine facing various possible situations, but we will use the following:

Our situation: After observing n coin flips, you need to guess “heads” or “tails”, for one final coin flip.

Here, you just need to “decide” what the next flip will be. You could face many other decisions, e.g. guessing the true value of w.

Now, suppose that you have a friend who seems very skilled at predicting the final coinflip. What information would you need to reproduce your friend’s skill? All you need to know is if your friend predicts heads or tails for each possible value of k. We think of this as a decision rule, which we write abstractly as

\mathrm{Dec}(k).

This is just a function of one integer k. You can think of this as just a list of what guess to make, for each possible observation, for example:

k \mathrm{Dec}(k)
0 \mathrm{tails}
1 \mathrm{heads}
2 \mathrm{heads}
\vdots \vdots
n \mathrm{tails}

One simple decision rule would be to just predict heads if you saw more heads than tails, i.e. to use

\mathrm{Dec}(k)=\begin{cases} \mathrm{heads}, & k\geq n/2 \\ \mathrm{tails} & k<n/2 \end{cases}.

The goal of statistics is to find the best decision rule, or at least a good one. The rule above is intuitive, but not necessarily the best. And… wait a second… what does it even mean for one decision rule to be “better” than another?

Our goal: minimize the thing that’s designed to be minimized

What happens after you make a prediction? Consider our running example. There are many possibilities, but here are two of the simplest:

  • Loss A: If you predicted wrong, you lose a dollar. If you predicted correctly, nothing happens.
  • Loss B: If you predict “tails” and “heads” comes up, you lose 10 dollars. If you predict “heads” and “tails” comes up, you lose 1 dollar. If you predict correctly, nothing happens.

We abstract these through a concept of a loss function. We write this as

L(w,d).

The first input is the true (unknown) value w, while second input is the “prediction” you made. We want the loss to be small.

Now, one point might be confusing. We defined our situation as predicting the next coinflip, but now L is defined comparing d to w, not to a new coinflip. We do this because comparing to w gives the most generality. To deal with our situation, just use the average amount of money you’d lose if the true value of the coin were w. Take loss A. If you predict “tails”, you’ll be wrong with probability w, while if you predict “heads”, you’ll be wrong with probability 1-w, and so lose 1-w dollars on average. This leads to the loss

L_{A}(w,d)=\begin{cases} w & d=\mathrm{tails}\\ 1-w & d=\mathrm{heads} \end{cases}.

For loss B, the situation is slightly different, in that you lose 10 times as much in the first case. Thus, the loss is

L_{B}(w,d)=\begin{cases} 10w & d=\mathrm{tails}\\ 1-w & d=\mathrm{heads} \end{cases}.

The definition of a loss function might feel circular– we minimize the loss because we defined the loss as the thing that we want to minimize. What’s going on? Well, a statistical problem has two separate parts: a model of the data generating process, and a loss function describing your goals. Neither of these things is determined by the other.

So, the loss function is part of the problem. Statistics wants to give you what you want. But you need to tell statistics what that is.

Despite the name, a “loss” can be negative– you still just want to minimize it. Machine learning, always optimistic, favors “reward” functions that are to be maximized. Plus ça change.

Model + Loss = Risk

OK! So, we’ve got a model of our data generating process, and we specified some loss function. For a given w, we know the distribution over k, so… I guess… we want to minimize it?

Let’s define the risk to be the average loss that a decision rule gives for a particular value of w. That is,

R(w,\mathrm{Dec})=\sum_{k}p(k\vert w,n)L(w,\mathrm{Dec}(k)).

Here, the second input to R is a decision rule– a precise recipe of what decision to make in each possible situation.

Let’s visualize this. As a set of possible decision rules, I will just consider rules that predict “heads” if they’ve seen at least m heads, and “tails” otherwise:

\mathrm{Dec}_{m}(k)=\begin{cases} \mathrm{heads} & k\geq m\\ \mathrm{tails} & k<m \end{cases}.

With n=21 there are 22 such decision rules, corresponding to m=0, (always predict heads), m=1 (predict heads if you see at least one heads), up to m=22 (always predict tails). These are shown here:

dec_points
These rules are intuitive: if you’d predict heads after observing 16 heads out of 21, it would be odd to predict tails after seeing 17 instead! It’s true that for losses L_{A} and L_{B}, you don’t lose anything by restricting to this kind of decision rule. However, there are losses for which these decision rules are not enough. (Imagine you lose more when your guess is correct.)

With those decision rules in place, we can visualize what risk looks like. Here, I fix w=0.2, and I sweep through all the decision rules (by changing m) with loss L_{A}:

onerisk_0.2_A

The value R_A in the bottom plot is the total area of the green bars in the middle. You can do the same sweep for w=0.4, which you can is pictured here:

onerisk_0.2_A

We can visualize the risk in one figure with various w and m. Notice that the curves for w=0.2 and w=0.4 are exactly the same as we saw above.

all_risk_L_A.png

Of course, we get a different risk depending on what loss function we use. If we repeat the whole process using loss L_{B} we get the following:

all_risk_L_B.png

Dealing with risk

What’s the point of risk? It tells us how good a decision rule is. We want a decision rule where risk is as low as possible. So you might ask, why not just choose the decision rule that minimizes R(w,\mathrm{Dec})?

The answer is: because we don’t know w! How do we deal with that? Believe it or not, there isn’t a single well-agreed upon “right” thing to do, and so we meet two different schools of thought.

Option 1 : All probability all the time

Bayesian statistics (don’t ask about the name) defines a “prior” distribution p(w) over w. This says which values of w we think are more and less likely. Then, we define the Bayesian risk as the average of R over the prior:

R_{\mathrm{Bayes}}(\mathrm{Dec})=\int_{w=0}^{1}p(w)R(w,\mathrm{Dec})dw.

This just amounts to “averaging” over all the risk curves, weighted by how “probable” we think w is. Here’s the Bayes risk corresponding to L_{A} with a uniform prior p(w)=1:

bayes_risk_A

For reference, the risk curves R(w,\mathrm{Dec}_m) are shown in light grey. Naturally enough, for each value of m, the Bayes risk is just the average of the regular risks for each w.

Here’s the risk corresponding to L_{B}:

bayes_risk_B

That’s all quite natural. But we haven’t really searched through all the decision rules, only the simple ones \mathrm{Dec}_m. For other losses, these simple ones might not be enough, and there are a lot of decision rules. (Even for this toy problem there are 2^{22}, since you can output heads or tails for each of k=0, k=1, …, k=21.)

Fortunately, we can get a formula for the best decision rule for any loss. First, re-write the Bayes risk as

R_{\mathrm{Bayes}}(\mathrm{Dec})=\sum_{k} \left( \int_{w=0}^{1}p(w)p(k\vert n,w)L(w,\mathrm{Dec}(k))dw \right).

This is a sum over k where each term only depends on a single value \mathrm{Dec}(k). So, we just need to make the best decision for each individual value of k separately. This leads to the Bayes-optimal decision rule of

\mathrm{Dec}_{\text{Bayes}}(k)=\arg\min_{d}\int_{w=0}^{1}p(w)p(k\vert w,n)L(w,d)dw.

With a uniform prior p(w)=1, here’s the optimal Bayesian decision rules with loss L_{A}:

d_bayes_A.png

And here it is for loss L_B:

d_bayes_B.png

Look at that! Just mechanically plugging the loss function into the Bayes-optimal decision rule naturally gives us the behavior we expected– for L_{B}, the rule is very hesitant to predict tails, since the loss is so high if you’re wrong. (Again, these happen to fit in the parameterized family \mathrm{Dec}_{m} defined above, but we didn’t use this assumption in deriving the rules.)

The nice thing about the Bayesian approach is that it’s so systematic. No creativity or cleverness is required. If you specify the data generating process (p(k\vert w,n)) the loss function (L(w,d)) and the prior distribution (p(w)) then the optimal Bayesian decision rule is determined.

There are some disadvantages as well:

  • Firstly, you need to make up the prior, and if you do a terrible job, you’ll get a poor decision rule. If you have little prior knowledge, this can feel incredibly arbitrary. (Imagine you’re trying to estimate Big G.) Different people can have different priors, and then get different results.
  • Actually computing the decision rule requires doing an integral over w, which can be tricky in practice.
  • Even if your prior is good, the decision rule is only optimal when averaged over the prior. Suppose, for every day for the next 10,000 years, a random coin is created with w drawn from p(w). Then, no decision rule will incur less loss than \mathrm{Dec}_{\text{Bayes}}. However, on any particular day, some other decision rule could certainly be better.

So, if you have little idea of your prior, and/or you’re only making a single decision, you might not find much comfort in the Bayesian guarantee.

Some argue that these aren’t really disadvantages. Prediction is impossible without some assumptions, and priors are upfront and explicit. And no method can be optimal for every single day. If you just can’t handle that the risk isn’t optimal for each individual trial, then… maybe go for a walk or something?

Option 2 : Be pessimistic

Frequentist statistics (Why “frequentist”? Don’t think about it!) often takes a different path. Instead of defining a prior over w, let’s take a worst-case view. Let’s define the worst-case risk as

R_{\mathrm{worst}}(\mathrm{Dec})=\max_{w}R(w,\mathrm{Dec}).

Then, we’d like to choose an estimator to minimize the worst-case risk. We call this a “minimax” estimator since we minimize the max (worst-case) risk.

Let’s visualize this with our running example and L_{A}:

worst_risk_A

As you can see, for each individual decision rule, it searches over the space of parameters w to find the worst case. We can visualize the risk with L_{B} as:

worst_risk_B

What’s the corresponding minimax decision rule? This is a little tricky to deal with– to see why, let’s expand the worst-case risk a bit more:

R_{\mathrm{Worst}}(\mathrm{Dec})=\max_{w}\sum_{k}p(k\vert n,w)L(w,\mathrm{Dec}(k)).

Unfortunately, we can’t interchange the max and the sum, like we did with the integral and the sum for Bayesian decision rules. This makes it more difficult to write down a closed-form solution. At least in this case, we can still find the best decision rule by searching over our simple rules \mathrm{Dec}_m. But be very mindful that this doesn’t work in general!

For L_{A} we end up with the same decision rule as when minimizing Bayesian risk:

d_minimax_A.png

For L_{B}, meanwhile, we get something slightly different:

d_minimax_B.png

This is even more conservative than the Bayesian decision rule. \mathrm{Dec}_{B-\mathrm{Bayes}}(2)=\mathrm{tails}, while \mathrm{Dec}_{B-\mathrm{minimax}}(2)=\mathrm{heads}. That is, the Bayesian method predicts heads when it observes 2 or more, while the minimax rule predicts heads if it observes even one. This makes sense intuitively: The minimax decision rule proceeds as if the “worst” w (a small number) is fixed, whereas the Bayesian decision rule less pessimistically averages over all w.

Which decision rule will work better? Well, if w happens to be near the worst-case value, the minimax rule will be better. If you repeat the whole experiment many times with w drawn from the prior, the Bayesian decision rule will be.

If you do the experiment at some w far from the worst-case value, or you repeat the experiment many times with w drawn from a distribution different from your prior, then you have no guarantees.

Neither approach is “better” than the other, they just provide different guarantees. You need to choose what guarantee you want. (You can kind of think of this as a “meta” loss.)

So what about all those formulas, then?

For real problems, the data generating process is usually much more complex than a Binomial. The “decision” is usually more complex than predicting a coinflip– the most common decision is making a guess for the value of w. Even calculating R(w,\mathrm{Dec}) for fixed w and \mathrm{Dec} is often computationally hard, since you need to integrate over all possible observations. In general, finding exact Bayes or minimax optimal decision rules is a huge computational challenge, and at least some degree of approximation is required. That’s the game, that’s why statistics is hard. Still, even for complex situations the rules are the same– you win by finding a decision rule with low risk.

A Divergence Bound For Hybrids of MCMC and Variational Inference and …

At ICML I recently published a paper that I somehow decided to title “A Divergence Bound for Hybrids of MCMC and Variational Inference and an Application to Langevin Dynamics and SGVI”. This paper gives one framework for building “hybrid” algorithms between Markov chain Monte Carlo (MCMC) and Variational inference (VI) algorithms. Then, it gives an example for particular algorithms, namely:

  • MCMC ⇔ Stochastic Gradient Langevin Dynamics [1] [2] [3]
  • VI ⇔ Stochatic Gradient VI [4] [5] [6]

Outline:

Is there a pleasing visualization?

Here’s three different views of the algorithm for a one-dimensional problem, interpolating between VI-like algorithms and MCMC-like algorithms as β goes from 0 (VI) to 1 (MCMC).

1d_example_ver

(Admittedly, this might not make that much sense at this point.)

What is VI?

The goal of “inference” is to be able to evaluate expectations with respect to some “target” distribution p(z). Variational inference (VI) converts this problem into the minimization of the KL-divergence KL(q_w(z) \Vert p(z)) for some simple class of distributions q_w(z) parameterized by w. For example, if p is a mixture of two Gaussians (in red), and q is a single Gaussian (in blue), the VI optimization in one dimension would arrive at the solution below.

samps_lab0.00

What is MCMC?

Given that same target distribution p(z), Markov chain Monte Carlo creates a random walk over z. The random walk is carefully constructed so that if you run it a long time, the probability it will end up in any given state is proportional to p(z). You can picture it as follows.

samps1.00

Why would you want to interpolate between them?

In short, VI is only an approximate algorithm, but MCMC can be very slow. In practice, the difference can be enormous– MCMC may require many orders of magnitude more time just to equal the performance of VI. This presents a user with an awkward situation where if one chooses the best algorithm for each time horizon, there’s a “gap” between when VI finishes until and when MCMC is better. Informally, you can get performance that looks like this:

time_cartoon.png

Intuitively, it seems like something better should be achievable at those intermediate times.

But they are so different. Is it even possible to combine them?

Very roughly speaking, you can define a random walk over the space of variational distributions. Then, you trade off between how random the walk is and how random the distributions are. You arrive at something like this:

samps0.32.png

Put another way, both VI and MCMC seek “high probability” regions in z, but with different coverage strategies:

  • VI explicitly includes the entropy in its objective
  • MCMC injects randomness into its walk

It is therefore natural to define a random walk over w, where we trade off between “how random” the walk is and “how much” high entropy w are favored.

That’s fine intuition but how can you guarantee anything formally?

Yes! Or, at least, sort of. To define a bit of notation, we start with a fixed variational family q(z|w) and a target distribution p(z). Now, we want to define a distribution q(w) (so we can do a random walk) so that

q(z) = \int_w q(w) q(z|w) \approx p(z).

The natural goal would be to minimize the KL-divergence KL(q(Z) \Vert p(Z))=\int_z q(z) \log(q(z)/p(z)). That’s difficult since q(z) is defined by marginalizing q(w) out– you can’t evaluate it. What you can do is set up two upper-bounds on this quantity.

The first bound is the conditional divergence:

KL(q(Z) \Vert p(Z)) \leq D_0 := \int_w q(w) \int_z q(z|w) \frac{q(z|w)}{p(z)}

The second bound is the joint divergence. You need to augment p(z) with some distribution p(w|z) and then you have the bound

KL(q(Z) \Vert p(Z)) \leq D_1 := \int_w q(w) \int_z q(z|w) \frac{q(w)q(z|w)}{p(z)p(w|z)}

Since these are both upper-bounds, a convex combination of them will also be. Thus, the goal is to find the distribution q(w) that minimizes D_\beta = (1-\beta)D_0 + \beta D_1, for any \beta in the [0,1] interval.

What distribution optimizes that bound?

First, note that D_\beta depends on the choice of p(w|z). You get a valid upper-bound for any choice, but the tightness changes. The paper uses p(w|z) = r(w) q(z|w) / r_z where r_z = \int_w r(w) q(z|w) is a normalizing constant. Here, you can think of r(w) as something akin to a “base measure”. r_z is restricted to beconstant over z. (This isn’t a terrible restriction– it essentially means that if r(w) were a prior for q(z|w), it wouldn’t favor any point.)

Taking that choice of p(w|z) the solution turns out to be:

q^*(w) = \exp( s(w) - A)
s(w) = \log r(w) - \log r_z + E_{q(Z|w)} [\beta^{-1} \log p(Z) + (1-\beta^{-1}) \log q(Z|w)]
A = \log \int_w \exp s(w)

Furthermore, the actual value of the divergence bound at the solution turns out to be just the normalizing constant A up to a constant, i.e.

D^*_\beta = - \beta A.

How would you apply this to real algorithms?

To do anything concrete, you need to look at a specific VI algorithm and a specific MCMC algorithm. The paper uses

  • Langevin dynamics as an MCMC algorithm, which iterate the upates

    z \leftarrow z + \frac{\epsilon}{w} \nabla_z \log p(z) + \sqrt{\epsilon} \eta

    where \eta is noise from a standard Gaussian and \epsilon is a step-size.

  • Stochastic gradient VI which uses the iteration

    w \leftarrow w - \frac{\epsilon}{2} \nabla_w KL(q(Z|w) \Vert p(Z))

To get the novel algorithm in this paper, all that really needs to be done is to apply Langevin dynamics to the distribution q^*(w) derived above. Then, after a re-scaling, this becomes the new hybrid algorithm

w \leftarrow w + \frac{\epsilon}{2} \nabla_w \Big( KL(q(Z|w) \Vert p(Z)) - \beta H(w) + \beta \log r_\beta(w) \Big) + \sqrt{\beta \epsilon} \eta.

Here, H is the entropy of q_w. This clearly becomes the previous VI algorithm in the limit of \beta \rightarrow 0. It also essentially becomes Langevin with \beta \rightarrow 1. That’s because the distribution r_\beta (not yet defined!) will prefer w where q(Z|w) is highly concentrated. Thus, only the mean parameters of w matter, and sampling w becomes equivalent to just sampling z.

I feel like you’re skipping some details here.

Yes. First, the experiments below use a diagonal Gaussian for q(z|w) with w=(\mu, \nu) and \nu_i = \log_{10} \sigma_i. Second, Tthe gradient of the objective involves a KL-divergence. Exactly computing this is intractable, but can be approximated with standard tricks from stochastic VI, namely data subsampling and the “reparameterization trick”. Third, r_\beta needs to be chosen. The experiments below use the (improper) distribution r_\beta(w) \propto \prod_i \mathcal{N}(\nu_i \vert u_\beta,1) where u_\beta is a universal constant chosen for each \beta to minimize the divergence bound with p(z) is a standard Gaussian. (If you — like I– find this displeasing, see below.)

Are there some more pictures of samples?

Here’s a couple 2-D examples sampling from a “doughnut” distribution and a “three peaks” mixture distribution. Here, the samples are visualized by putting a curve at one standard deviation around the mean. Notice it smoothly becomes more “MCMC-like” as \beta increases.

three_peaks.png

donut.png

What about “real” problems? Can you show samples there?

Sure, but of course in more than 2 dimensions its hard to show samples. Here are some results sampling from a logistic regression model on the classic ionosphere dataset. As a comparison, I implemented the same model with STAN and ran it a huge amount of time to generate “presumed correct” samples. I then projected all samples to the first two principal components.

ionosphere_samples.png

(Note: technically what’s shown here is a sample z being drawn from each sampled w)

The top row shows the results after 104 iterations, the middle row after 105 and the bottom row after 106 You can roughly see that for small time horizons you are better off using a lower value of \beta but for higher time horizons you should use a larger value.

Do you get the desired speed/accuracy tradeoff?

Here, you need to compare the error each value of \beta creates at each time horizon. This is made difficult by the fact that you also need to select a step-size \epsilon and the best step-size changes depending on the time and \beta. To be as fair as possible, I ran 100 experiments with a range of step-sizes, and averaged the performance. Then, for each value of \beta and each time horizon, the results are shown with the best timestep. (Actually, this same procedure was used to generate the previous plots of samples as well.)

logreg_errors.png

The above plot shows the error (measured in MMD) on the y axis against time on the x-axis. Note that both axes are logarithmic. There are often several orders of magnitude of time horizons where an intermediate algorithm performs better than pure VI (β=0) or pure MCMC (β=1).

Is there anything that remains unsatisfying about this approach?

The most unsatisfying thing about this approach is the need to choose p(w|z). This is a bit disturbing, since this is not an object that “exists” in either the pure MCMC or pure VI worlds. On the other hand, there is a strong argument that it needs to exist here. If you carefully observe q^*(w) above, you’ll notice that it depends on the particular parameterization of w. So, e.g. if we “stretched out” part of the space of w this would change the marginal density q(z). That would be truly disturbing, but if p(w|z) is transformed in the opposite way, it would counteract that. So, p(w|z) needs to exist to reflect how we’ve parameterized w.

On the other hand, simply picking a single static distribution r_\beta(w) is pretty simplistic. (Recall, p(w|z) was defined in terms of r(w) above) It would be natural to try to adjust this distribution during inference to tighten the bound D^*_\beta. Using the fact that D^*_\beta=-\beta A you can show that it’s possible to find derivatives of D^*_\beta with respect to the parameters of r online, and thus tighten the bound while the algorithm is running. (I didn’t want to do this in this paper since neither VI nor MCMC do this, and it complicates the interpretation of the experiments.)

Finally, the main question is if this can be extended to other pairs of VI / MCMC algorithms. I actually first derived this algorithm by looking at simple discrete graphical models, e.g. Ising models. There, you can use the algorithms:

  • VI: Use a fully-factorized variational distribution and single-site coordinate ascent updates
  • MCMC: Use single-site Gibbs sampling.

You do in fact get a useful hybrid algorithm in the middle. However, the unfortunate reality is that both of the endpoints are considered pretty bad algorithms, so its hard to get too excited about the interpolation.

Finally, do note that there are other ideas out there for combining MCMC and VI. However, these usually fall into the camps of “putting MCMC inside of VI” [7] [8] [9] or “putting VI inside of MCMC” [10], rather than a straight interpolation of the two.

The second and third best features of LyX you aren’t using.

LyX is a WYSIWYG editor for latex files. It’s a little bit clunky to use at first, and isn’t perfect (thank you, open source developers– I’m not ungrateful!) but after becoming familiar with it, it’s probably the single piece of software that has most improved my productivity. I like it so much I use it not just for papers but often as a scratchpad for math. Many times during random discussions I’ve used it to quickly bang out some equations and after seeing how fast it was, others immediately switched to using it.

Anyway, while macros may be the best feature you LyX aren’t using, I recently discovered another couple excellent ones I wasn’t familiar with after years of use so I thought I’d publicize. Specifically, I’ve always hated the process of including explanatory figures into LyX. Exporting plots from an experiment is tricky to improve, but when trying to create explanatory graphs, I’ve always hated the process.

Before, I thought the options were.

1) Create the graph in an external program. This is fine, of course, but is quite inconvenient when you want to go back and revise it. The external program usually saves it in some other format, so you have to open the graphic in open it again, revise it, export it to .pdf (or whatever), then open the document in LyX, compile it. Then, when you don’t like the way it looks, you have to repeat the whole process. It works, but it’s not efficient, since you can’t edit the content in place. (Which is the whole point of using a WYSIWYG editor in the first place– remove the need for thinking about anything but content.)

2) Write the graphics directly in LyX in a language like TikZ. This is more “in-place” in that you don’t have external files to find and manipulate. However, I find TikZ to be quite painful to get right with many re-compilations necessary. If the TikZ document is in place each requires a full compilation of the document. This is hilariously slow when making something like Beamer slides. Further, this totally violates the whole point of WYSIWYG since you’re looking at code, rather than the output.

There are better ways! I’ve wasted countless hours not being aware of these.

Preview Boxes

First, LyX has a beautiful feature of “preview boxes”. Take the following very simple TikZ code, which just draws a square:


\begin{tikzpicture}
\draw[red] (0,0) -- (0,1);
\draw[green] (0,1) -- (1,1);
\draw[red] (1,1) -- (1,0);
\draw[blue] (1,0) -- (0,0);
\end{tikzpicture}

Typically, I’d include this in LyX files by inserting a raw tex box:

Screen Shot 2017-04-09 at 9.52.44 AM.png

And then putting the the TikZ code inside:

Screen Shot 2017-04-09 at 9.54.23 AM.png

This is OK, but has the disadvantages from (2) above. The code can be huge, if I have a lot of graphics, I can’t tell what corresponds to what, and I have to do a (slooooooow) recompile of the whole document to see what it looks like.

However, if you just add a “preview box”:

Screen Shot 2017-04-09 at 9.56.13 AM

You get something that looks like this:

Screen Shot 2017-04-09 at 9.56.33 AM.png

So far, so pointless, right? However, when you deselect, LyX shows the graphic in-place:

Screen Shot 2017-04-09 at 9.57.50 AM

You can then click on it to expand the code. This solves most of the problems: You can see what you are doing at a glance, and you don’t need to recompile the whole document to do it.

Native SVG support

Newer versions of LyX also natively support SVG files. You first have to create the file externally using something like Inkscape, which itself saves directly to the SVG format. Then, you can include it in LyX by doing Insert->File->External Material:

Screen Shot 2017-04-09 at 10.02.31 AM.png

And then selecting the SVG file:

Screen Shot 2017-04-09 at 10.03.14 AM

Again, LyX will show it in-place and (if LyX is configured correctly…) correctly output vector graphics in the final document.

Screen Shot 2017-04-09 at 10.05.15 AM

What’s even better is that LyX can automatically open the file in the external editor for you. If you right click, you can “edit externally”:

Screen Shot 2017-04-09 at 10.06.11 AM

Then the external editor will automatically open the file. You can then save it with a keystroke and go back to LyX. No hunting around for the file, no cycles of exporting to other formats, and you see exactly what the final output will look like at all stages. You can really tell that LyX was created by people using it themselves.

Bonus feature

This one is described well-enough already, but helps a lot in big documents: you can click on a point in a generated .pdf and automatically have LyX sync the editor to the corresponding point in the file.

You deserve better than two-sided finite differences

In calc 101, the derivative is derived as df/dx = \lim_{\epsilon \rightarrow 0} (f(x+\epsilon)-f(x))/\epsilon. So, if you want to estimate a derivative, an easy way to do so would be to just pick some small \epsilon and estimate:

df/dx \approx \frac{1}{\epsilon} (f(x+\epsilon)-f(x))

This can work OK. Let’s look at an example of trying to calculate the derivative of \log (x), using a range of different \epsilon

log_0.1_1sided

What’s happening? Well, the result is true mathematically in the limit that \epsilon is small, so it’s natural to get errors for large \epsilon. However, with very small \epsilon you run into trouble because floating-point arithmetic can only represent finite precision. Let’s try again with a smaller value of x=.001

log_0.001_1sided

That’s somewhat concerning. We can still get a nearly correct value, but we have a limited range of steps that achieve it. This is very well-known in numerical analysis, and a common solution is to use two-sided differences, i.e. to estimate

df/dx \approx \frac{1}{2 \epsilon} (f(x+\epsilon)-f(x-\epsilon)).

If we try that, we indeed get better results:

log_0.001_2sided

What’s happening here? Basically, we are using more information about the function to make a higher-order approximation, so we can mathematically get away with using a larger value of \epsilon. This in turn isis helpful to avoid the numerical precision demons.

Great! But let’s go deeper. If we use x = 10^{-5}, we get:

log_1e-05_2sided

Huh. 1-sided differences totally fall apart, but we seem to be running into more trouble. But never fear, you can use “four-sided differences”!

df/dx \approx \frac{1}{12 \epsilon} (-f(x+2\epsilon)+8 f(x+\epsilon)-8 f(x+\epsilon)+ f(x-2 \epsilon)

Then, we get what you might expect:

log_1e-05_4sided

But what if we go deeper, with x=10^{-7}?log_1e-07_4sided

Or maybe we should go even deeper, with x = 10^{-9}?

log_1e-09_4sided

Even the four-sided differences have failed us. Now, you might take a lesson here that you shouldn’t be using numerical differences (and I sort of agree). Those of certain temperament, on the other hand, would say instead that what we need is power. OK then, how do six-sided differences sound to you? Sound good?

df/dx \approx \frac{1}{60 \epsilon}(- f(x-3 \epsilon)+9 f(x-2 \epsilon)-45 f(x-\epsilon)+45 f(x+\epsilon)-9 f(x+2\epsilon)+f(x+3 \epsilon))

log_1e-09_6sided

This is still tough, but there is at least some range of epsilon where you can calculate a reasonable derivative. There is, of course, a never-ended sequence of these higher-order derivative approximations. There’s even a calculator, for all your bespoke finite-difference stencil needs.

Now, you might think that the real solution here is to use automatic differentiation, and you’re mostly right. It takes more computation to sample the function at more points, and no number of samples will fundamentally stop the numerical demons from destroying your result.

However, there still remain cases where it’s worthwhile to manually compute a derivative. More importantly perhaps, when you’re implementing an automatic differentiation tool, you still need to test that it is actually correct! Here, numerically differences will probably forever remain useful. So, certainly for the cases of building a test suite for autodiff code, it certainly makes sense to use these higher-order derivatives.

(I originally intended to leave this as a comment on Tim Viera’s post on testing gradient implementations, but this kinda got out of control.)

Sneaking up on Bayesian Inference (A fable in four acts)

monkey

Act 1: Magical Monkeys

Two monkeys, Alfred ({a}) and Betty ({b}) live in a parallel universe with two kinds of blocks, green ({g}) and yellow ({y}). Alfred likes green blocks, and Betty prefers the yellow blocks. One day, a Wizard decides to give one of the monkeys the magical power to send one block over to our universe each day.

The Wizard chooses the magical monkey by flipping a fair four-sided die. He casts a spell on Alfred if the outcome is 1, and Betty if the outcome is {2-4}. That is, the Wizard chooses Alfred with probability {.25} and Betty with probability {.75}. Both Alfred and Betty send their favorite colored block with probability {.8}

After the Wizard has chosen, we see the first magical block, and it is green. Our problem is: What is the probability that Alfred is the magical monkey?

Intuitively speaking, we have two somewhat contradictory pieces of information. We know that the Wizard chooses Betty more often. But green is Alfred’s preferred color. Given probabilities above, is Alfred or Betty more probable?

First, we can write down all the probabilities. These are

\displaystyle \begin{aligned}Pr(M=a) & =.25\\ Pr(M=b) & =.75\\ Pr(B=g\vert M=a) & =.8\\ Pr(B=y\vert M=a) & =.2\\ Pr(B=g\vert M=b) & =.2\\ Pr(B=y\vert M=b) & =.8. \end{aligned}

Now, the quantity we are interested in is {Pr(M=a\vert B=g).} We can mechanically apply Bayes’ rule to obtain

\displaystyle \begin{aligned}Pr(M=a\vert B=g) & =\frac{1}{Pr(B=g)}Pr(M=a)Pr(B=g\vert M=a)\\ & =\frac{1}{Pr(B=g)}.25\times.8. \end{aligned}

Similarly, we can calculate that

\displaystyle Pr(M=b\vert B=g)=\frac{1}{Pr(B=g)}.75\times.2.

But, of course, we know that one of the monkeys is magical, so these quantities must sum to one. Thus (since they both involve the quantity {Pr(B=g)} that we haven’t explicitly calculated) we can normalize to obtain that

\displaystyle Pr(M=a\vert B=g)=\frac{.25\times.8}{.25\times.8+.75\times.2}=\frac{4}{7}.

Act 2: Magical Monkeys on Multiple Days

We wait around two more days, and two more blocks appear, both of which are yellow. Assume that the monkeys make an independent choice each day to send their favorite or less favorite block. Now, what is the probability that Alfred is the magical monkey, given that {B=gyy}?

Intuitively speaking, what do we expect? Betty is more likely to be chosen, and 2/3 of the blocks we’ve seen are suggestive of Betty (since she prefers yellow).

Now, we can mechanically calculate the probabilities. This is just like before, except we use the fact that the blocks seen on each day are conditionally independent given a particular monkey.

\displaystyle \begin{aligned}Pr(M=a\vert B=gyy) & =\frac{1}{Pr(B=gyy)}Pr(M=a)Pr(B=gyy\vert M=a)\\ & =\frac{1}{Pr(B=gyy)}Pr(M=a)Pr(B=g\vert M=a)Pr(B=y\vert M=a)Pr(B=y\vert M=a)\\ & =\frac{1}{Pr(B=gyy)}.25\times.8\times.2\times.2\\ Pr(M=b\vert B=gyy) & =\frac{1}{Pr(B=gyy)}Pr(M=b)Pr(B=g\vert M=b)Pr(B=y\vert M=b)Pr(B=y\vert M=b)\\ & =\frac{1}{Pr(B=gyy)}.75\times.2\times.8\times.8 \end{aligned}

Again, we know that these two probabilities sum to one. Thus, we can normalize and get that

\displaystyle Pr(M=a\vert B=gyy)=\frac{.25\times.8\times.2\times.2}{.25\times.8\times.2\times.2+.75\times.2\times.8\times.8}=\frac{1}{13}.

So now, Betty looks more likely by far.

Act 3: Magical Monkeys and the Weather

Now, suppose the Wizard gives us some additional information. The magical monkey (whilst relaxing over in the other universe) is able to perceive our weather, which is either clear {(c)} or rainy {(r)}. Both the monkeys prefer clear weather. When the weather is clear, they send their preferred block with probability {.8}, while if the weather is rainy, they angrily send their preferred block with probability {.2.} That is, we have

\displaystyle \begin{aligned}Pr(B=g\vert M=a,W=c) & =.8\\ Pr(B=g\vert M=a,W=r) & =.2\\ Pr(B=g\vert M=b,W=c) & =.2\\ Pr(B=g\vert M=b,W=r) & =.8. \end{aligned}

Along with seeing the previous sequence of blocks {B=gyy}, we observed the weather sequence {W=rrc}. Now, what is the probability that Alfred is the magical monkey?

We ask the Wizard what the distribution over rainy and clear weather is. The wizard haughtily responds that this is irrelevant, but does confirm that the weather is independent of which monkey was made magical.

Can we do anything without knowing the distribution over the weather? Can we calculate the probability that Alfred is the magical monkey?

Let’s give it a try. We can apply Bayes’ equation to get that

\displaystyle \begin{aligned}Pr(M=a\vert B=gyy,W=rrc) & =\frac{1}{Pr(B=gyy,W=rrc)}Pr(M=a)Pr(B=gyy,W=rrc\vert M=a).\end{aligned}

Now, since the weather is independent of the monkey, we know that

\displaystyle \begin{aligned}Pr(B=gyy,W=rrc\vert M=a) & =Pr(W=rrc\vert M=a)Pr(B=gyy\vert M=a,W=rrc)\\ & =Pr(W=rrc)Pr(B=gyy\vert M=a,W=rrc). \end{aligned}

Thus, we have that

\displaystyle \begin{aligned}Pr(M=a\vert B=gyy,W=rrc) =&\frac{Pr(W=rrc)}{Pr(B=gyy,W=rrc)}Pr(M=a)Pr(B=gyy\vert M=a,W=rrc)\\ =&\frac{Pr(W=rrc)}{Pr(B=gyy,W=rrc)}Pr(M=a) \\ & \times Pr(B=g\vert M=a,W=r)Pr(B=y\vert M=a,W=r)Pr(B=y\vert M=a,W=c)\\ =&\frac{Pr(W=rrc)}{Pr(B=gyy,W=rrc)}.25\times.2\times.8\times.8. \end{aligned}

Through the same logic we can calculate that

\displaystyle Pr(M=b\vert B=gyy,W=rrc)=\frac{Pr(W=rrc)}{Pr(B=gyy,W=rrc)}.75\times.8\times.2\times.2.

Again, we know that these probabilities need to sum to one. Since the factor {Pr(W=rrc)/Pr(B=gyy,W=rrc)} is constant between the two, we can normalize it out. Thus, we get that

\displaystyle Pr(M=a\vert B=gyy,W=rrc)=\frac{.25\times.2\times.8\times.8.}{.25\times.2\times.8\times.8.+.75\times.8\times.2\times.2.}=\frac{4}{7}.

Again, it looks like Alfred was the more likely monkey. And– oh wait– we somehow got away with not knowing the distribution over the weather…

Act 4: Predicting the next block

Now, suppose that after seeing the previous sequences (namely {B=gyy} and {W=rrc}) , on the fourth day, we find that it has rained. What is the probability that we will get a green block on the fourth day? Mathematically, we want to calculate

\displaystyle Pr(B'=g\vert W'=r,B=ggy,W=rrc).

How can we go about this? Well, if we knew that the magical monkey was Alfred (which we do not!), it would be easy to calculate. We would just have

\displaystyle Pr(B'=g\vert W'=r,M=a),

and similarly if the magical monkey were Betty. Now, we don’t know which monkey is magical, but we know the probabilities that each monkeys is magical given the available information– we just calculated them in the previous section! So, we can factorize the distribution of interest as

\displaystyle \begin{aligned}Pr(B'=g\vert W'=r,B=ggy,W=rrc) & =Pr(B'=g\vert W'=r,M=a)Pr(M=a\vert B=ggy,W=rrc)\\ & +Pr(B'=g\vert W'=r,M=b)Pr(M=b\vert B=ggy,W=rrc)\\ & =.2\times\frac{4}{7}+.8\times\frac{3}{7}\\ & =\frac{33}{70} \end{aligned}

So we are slightly less likely than even to see a green block on the next day.

Epilogue: That’s Bayesian Inference

This is how Bayesian inference works (in the simplest possible setting). In general Bayesian inference, you have:

  • A set of possible inputs (for us these are weather patterns)
  • A set of possible outputs (for us these are colored blocks)
  • A set of possible models, that map a given input to a distribution over the outputs (for us these are monkeys)
  • A “prior” distribution over which models are more likely (for us, this is the wizard and the four-sided die)
  • A dataset of inputs and outputs (for us, this is {W=rrc} and {B=ggy})
  • A new input (for us this is {W'=r})
  • An unknown output that we’d like to predict (for us this is {B'})

Bayesian inference essentially proceeds in two steps:

  • In the first step, one uses the prior over which models are more likely along with the dataset of inputs and outputs to build a “posterior” distribution over which models are more likely. We did this by using

    \displaystyle Pr(M\vert W,B)\propto Pr(M)\prod_{i}Pr(B_{i}\vert W_{i},M),

    where {i} indexes the different days of observation.

  • In the second step, one “integrates over the posterior”. That is, each model gives us a distribution over {B'} given the new input {W'}. So, we simply sum over the different models

    \displaystyle Pr(B'\vert W')=\sum_{m}Pr(M=m\vert W,B)Pr(B'\vert W',M=m).

In the general case, the set of possible models is much larger (typically infinite) and more complex computational methods need to be used to integrate over it. (Commonly Markov chain Monte Carlo or variational methods). Also, of course, we don’t typically have a Wizard telling us exactly what the prior distribution over the models is, meaning one must make assumptions, or try to “learn” the prior as in, say, empirical Bayesian methods.

Algorithmic Dimensions

There are many dimensions on which we might compare a machine learning or data mining algorithm. A few of the first that come to mind are:

1) Sample complexity, convergence

algA_vs_algBHow much predictive power is the algorithm able to extract from a given number of examples?

All else being equal, if algorithm A with N examples behaves the same as algorithm B with 2N examples, we would prefer algorithm A. This can vary in importance depending on how scarce or expensive data is.

2) Speed

How quickly does the algorithm run?

Obviously, a tool is more useful if it is faster. However, if that faster algorithm comes at the expense of sample complexity, one would need to measure the expense of running longer against the expense of gathering more data.

3) Guarantees

Does the algorithm just have good performance (along whatever dimension) in practice, or is it proven? Is it always good, or only in certain situations? How predictable is the performance?

When SVMs showed up, it looked like good practical algorithms came from good theory. These days, it seems clear that powerful intuition is also an excellent source of practical algorithms (deep learning).

Perhaps theory will some day catch up, and the algorithm with the best bounds will coincide with the best practical performance. However, this is not always the case today, where we often face a trade-offs between algorithms with good theoretical guarantees, and good empirical performance. Some examples are 1) Upper-Confidence Bounds versus Thomson Sampling with bandit algorithms [Update June 2018: Thompson sampling has largely caught up!] 2) Running a convex optimization algorithm with a theoretically derived Lipschitz constant versus a smaller one that still seems to work and 3) Doing model selection via VC-dimension generalization bounds versus using K-fold cross-validation.

4) Memory usage

How much space does the algorithm need?

I work with a lot of compute-heavy applications where this is almost a wall: we don’t care about memory usage until we run out of it, after which we care a great deal. With simpler algorithms and larger datasets, this is often more nuanced, with concerns about different cache sizes.

5) Handholding

Can it be used out of the box, or does it require an expert to “tune” it for best performance?

A classic example of this is stochastic gradient descent. In principle, for a wide range of inputs, convergence is guaranteed by iteratively setting \theta_{k+1} = \theta_k - \lambda_k g_k where g_k is a noisy unbiased estimate of the gradient at \theta_k and \lambda_k is some sequence of step-sizes that obeys the Robbins-Monro conditions [1]. However, in practice, the difference between, say, \lambda_k = 1/k, and \lambda_k = 7.2/(522.9+k) can be enormous, and finding these constants is a bit of a dark art.

[1] (\sum_k \lambda_k=\infty and \lim_{k\rightarrow\infty} \lambda_k=0.)

6) Implementability

How simple is the algorithm? How easy is it to implement?

This is quite complex and situational. These days, I’d consider an algorithm that consists of a few moderate-dimensional matrix multiplications or singular value decompositions “simple”. However, that’s due to the huge effort that’s been devoted to designing reliable matrix algorithms, and the ubiquity of easy to use libraries.

7) Amenability to parallelization

Does the algorithm lend itself to parallelization? (And what type of parallelization?)

And if it does, under what model? Map-reduce, GPU, MPI, and openMP all have different properties.

8) “Anytime-ness”

Can the algorithm be implemented in any anytime manner?

That is, does the algorithm continuously return a current “best-guess” of the answer that is refined over time in a sensible manner? This can help diagnose problems before running the full algorithm, enormously useful in debugging large systems.

Note this is distinct from being an online algorithm, which I’m not mentioning here, since it’s a mix of speed and memory properties.

9) Transparency, interpretability

Can the final result be understood? Does it give insight into how predictions are being made?

treeGalit Shmueli argues that “explanatory power” and “predictive power” are different dimensions. However, there are several ways in which interpretability is important even when prediction is the final goal. Firstly, the insight might convince a decision maker that the machine learning system is reliable. Second, this can be vital in practice for generalization. The world is rarely independent and identically distributed. If a domain expert can understand the predictive mechanism, they may be able to assess if this will still hold in the future, or captures something true only in the training period. Third, understanding what the predictor is doing also often yields ways to improve it. For example, the adjacent visualization of the outputs of a decision tree in two-dimensions suggests the need for non axis-aligned splits.

10) Generality

What class of problems can the algorithm address?

All else being equal, we prefer an algorithm that can, say, optimize all convex losses over one that can only fit the logistic loss. However, this often comes at a cost— a more general-purpose algorithm cannot exploit the extra structure present in a specialized problem (or, at least, has more difficulty doing so). It’s instructive how many different methods are used by LIBLINEAR depending on the particular loss and regularization constant.

11) Extendability

Does the algorithm have lots of generalizations and variants that are likely to be interesting?

More of an issue when reviewing a paper than deciding what is the final best algorithm to use once all the dust has settled.

12) Insight

Does the algorithm itself (as opposed to its results) convey insight into the problem?

Gradient descent for maximum likelihood learning of an exponential family is a good example, as it reveals the moment-matching conditions. Insight of this type, however, doesn’t suggest that one should actually run the algorithm.

13) Model robustness

How does the performance of the algorithm hold up to violations of its modeling assumptions?

Suppose we are fitting a simple line to some 2-D data. Obviously, all else being equal, we would prefer an algorithm that still does something reasonable when the actual dependence when the expected value of y is not linear in x. The example I always harp on here is the pseudolikelihood. The original paper pointed out that this will have somewhat worse sample complexity that the full likelihood, and (much!) better time complexity. Many papers seem to attribute the bad performance of the pseudolikelihood in practice to this sample complexity, when the true cause is that the likelihood does something reasonable (minimizes KL divergence) when there is model mis-specification, but the pseudolikelihood does not.

I often ponder is how much improvement in one dimension is enough to “matter”. Personally, I would generally consider even a small constant factor (say 5-10%) improvement in sample complexity quite important. Meanwhile, it would be rare to get exited about even, say, a factor of two improvement in running time.

What does this reflect? Firstly, generalization is the fundamental goal of data analysis, and we are likely to be willing to compromise most things if we can really predict better. Second, we instinctively distrust running times. Theory offers few tools for understanding constant factors, as these are highly architecture and implementation dependent. In principle, if one could be completely convincing that a factor of two improvement was truly there, I think this probably would be significant. (This might be true, say, if all algorithms are bottlenecked by a few matrix multiplications, and a new algorithm provably reduces the number needed.) However, this is rare.

In some places, I think constant factor skepticism can lead us astray. In reality, a factor of 30 improvement in speed is probably better than changing a O(N \log N) complexity to O(N). (Calculate N such that \log N=30.) This is particularly true when the lower-complexity algorithm has higher constant factors. As an example, I’ve always found the O(N \log N) algorithm for projection onto the l_1 ball to be faster than the O(N) algorithm in practice.

Given that there are so many dimensions, can a new algorithm really improve on all simultaneously? It seems rare to even improve on a single dimension without a corresponding cost somewhere else. There will often be a range of techniques that are Pareto optimal, depending on one’s priorities. Understanding this range is what makes having an expert around so important.

Ideally, as a community, we would be able to provide a “consumer” of machine learning an easy way to find the algorithm for them. Or, at least, we might be able to point them in the direction of possible algorithms. One admirable attempt along this line is the following table from The Elements of Statistical Learning:

hasite et al

(Incidentally, notice how many of the desiderata do not overlap with mine.)

Some other situations show a useful contrast. For example, take this decision tree for choosing an algorithm for unconstrained optimization, due to Dianne O’Leary:

olearyEssentially, this amounts to the principle that one should use the least general algorithm available, so that it can exploit as much structure of the problem as possible. Though one can quibble with the details (pity the subgradient methods) it at least comes close to giving the “right” algorithm in each situation.

This doesn’t seem possible with machine learning, since there doesn’t exist a single hierarchy Rather, ML problems are a tangle of model specification, computational and architecture requirements, implementation constraints, user risk-tolerances and so on. It won’t be easy to automate away the experts. (Even ignoring the possible misalignment of incentives in that the field has the domain of automating itself.)

(This post incorporates comments from Aditya Menon and Cheng Soon Ong.)