# The human regression ensemble

I sometimes worry that people credit machine learning with magical powers. Friends from other fields often show me little datasets. Maybe they measured the concentration of a protein in some cell line for the last few days and they want to know what it will be tomorrow.

Day Concentration
Monday 1.32
Tuesday 1.51
Wednesday 1.82
Thursday 2.27
Friday 2.51
Saturday ???

Sure, you can use a fancy algorithm for this, but I usually recommend to just stare hard at the data, use your intuition, and make a guess. My friends respond with horror—you can’t just throw out predictions, that’s illegal! They want to use a rigorous method with guarantees.

Now, it’s true we have methods with guarantees, but those guarantees are often a bit of a mirage. For example, you can do a linear regression and get a confidence interval for the regression coefficients. That’s fine, but you’re assuming (1) the true relationship is linear, (2) the data are independent, (3) the noise is Gaussian, and (4) the magnitude of noise is constant. If those (unverifiable) assumptions aren’t true, your guarantees don’t hold.

But is it really true that humans can do as well as algorithms for simple tasks? Let’s test this.

## What I did

1. I defined four simple one-dimensional regression problems using common datasets. For each of those problems, I split the data into a training set and a test set. Here’s what that looks like for the boston dataset

2. I took the training points, and plotted them to a .pdf file as black dots, with four red dots for registration.

In each .pdf file there were 25 identical copies of the training data like above.

3. I transferred that .pdf file to my tablet. On the tablet, I hand-drew 25 curves that I felt were all plausible fits of the data.

4. I transferred the labeled .pdf back to my computer, and wrote some simple image processing code that would read in all of the lines and average them. I then used this average to predict the test data.

5. As a comparison, I made predictions for the test data using six standard regression methods: Ridge regression (Ridge), local regression (LOWESS), Gaussian processes regression (GPR), random forests (RF), neural networks (MLP) and K-nearest neighbors (K-NN). More details about all these methods are below.

6. To measure error, I computed the root mean squared error (RMSE) and the mean absolute error (MAE).

To make sure the results were fair, I committed myself to just drawing the curves for each dataset once, and never touching them again, even if I did something that seems stupid in retrospect—which as you’ll see below, I did.

On the other hand, I had to do some of tinkering with all the machine learning methods to get reasonable results, e.g. changing how neural networks were optimized, or what hyper-parameters to cross-validate over. This might create some bias, but if it does, it’s in favor of the machine learning methods and against me.

## Results

For the boston dataset, I used the crime variable for the x-axis and house value variable for the y-axis. Here’s all the lines I drew on top of each other:

And here are the results comparing to the machine learning algorithms:

Here are the results for the diabetes dataset. I used age for the x-axis and disease progression for the y-axis. (I don’t think I did a great job drawing curves for this one.)

Here are the results for the iris dataset, using sepal length for the x-axis and petal width for the y-axis.

And finally, here are the results for the wine dataset, using malic acid for the x-axis and alcohol for the y-axis.

I tend to think I under-reacted a bit to the spike of data with with x around 0.2 and large y values. I thought at the time that it didn’t make sense to have a non-monotonic relationship between malic acid and alcohol. However, in retrospect it could easily be real, e.g. because it’s a cluster of one type of wine.

## Summary of results

Here’s a summary of the the RMSE for all datasets.

Method Boston Diabetes Iris Wine
Ridge .178 .227 .189 .211
LOWESS .178 .229 .182 .212
Gaussian Process .177 .226 .184 .204
Random Forests .192 .226 .192 .200
Neural Nets .177 .225 .185 .211
K-NN .178 .232 .186 .202
justin .178 .230 .181 .204

And here’s a summary of the MAE

Method Boston Diabetes Iris Wine
Ridge .133 .191 .150 .180
LOWESS .134 .194 .136 .180
Gaussian Process .131 .190 .139 .170
Random Forests .136 .190 .139 .162
Neural Nets .131 .190 .139 .179
K-NN .129 .196 .137 .165
justin .121 .194 .138 .171

Honestly, I’m a little surprised how well I did here—I expected that I’d do OK but that some algorithm (probably LOWESS, still inexplicably not in base scikit-learn) would win in most cases.

I’ve been doing machine learning for years, but I’ve never run a "human regression ensemble" before. With practice, I’m sure I’d get better at drawing these lines, but I’m not going to get any better at applying machine learning methods.

I didn’t do anything particularly clever in setting up these machine learning methods, but it wasn’t entirely trivial (see below). A random person in the world is probably more likely than I was to make a mistake when running a machine learning method, but just as good at drawing curves. This is an extremely robust way to predict.

What’s the point of this? It’s just that machine learning isn’t magic. For simple problems, it doesn’t fundamentally give you anything better than you can get just from common sense.

Machine learning is still useful, of course. For one thing, it can be automated. (Drawing many curves is tedious…) And with much larger datasets, machine learning will—I assume—beat any manual predictions. The point is just that in those cases it’s an elaboration on common sense, not some magical pixie dust.

## Details on the regression methods

Here were the machine learning algorithms I used:

1. Ridge: Linear regression with squared l2-norm regularization.
2. LOWESS: Locally-weighted regression.
3. GPR: Gaussian-process regression with an RBF kernel
4. RF: Random forests
5. MLP: A single hidden-layer neural network / multi-layer perceptron with tanh nonlinearities, optimized by (non-stochastic) l-bfgs with 50,000 iterations.
6. KNN: K-nearest neighbors

For all the methods other than gaussian processes, I used 5-fold cross-validation to tune the key hyper- parameter. The options I used were

1. Ridge: Regularization penalty of λ=.001, .01, .1, 1, or 10.
2. LOWESS: Bandwidth of σ=.001,.01,.1,1,10
3. Random forests: Minimum samples in each leaf of n=1,2,…,19
4. Multi-layer perceptrons: Used 1, 5, 10, 20, 50, or 100 hidden units, with α=.01 regularization.
5. K-nearest neighbors: used K=1,2,…,19 neighbors.

For Gaussian processes, I did not use cross-validation, but rather scikit-learn’s built-in hyperparameter optimization. In particular, I used the magical incantation kernel = ConstantKernel(1.0,(.1,10)) + ConstantKernel(1.0,(.1,10)) * RBF(10,(.1,100)) + WhiteKernel(5,(.5,50)) which I understand means the system optimizes the kernel parameters to maximize the marginal likelihood.

# Statistics – the rules of the game

What is statistics about, really? It’s easy to go through a class and get the impression that it’s about manipulating intimidating formulas. But what’s the goal of them? Why did people invent them?

If you zoom out, the big picture is more conceptual than mathematical. Statistics has a crazy, grasping ambition: it wants to tell you how to best use observations to make decisions. For example, you might look at how much it rained each day in the last week, and decide if you should bring an umbrella today. Statistics converts data into ideal actions.

Here, I’ll try to explain this view. I think it’s possible to be quite precise about this while using almost no statistics background and extremely minimal math.

The two important characters that we meet are decision rules and loss functions. Informally, a decision rule is just some procedure that looks at a dataset and makes a choice. A loss function — a basic concept from decision theory– is a precise description of “how bad” a given choice is.

## Model Problem: Coinflips

Let’s say you’re confronted with a coin where the odds of heads and tails are not known ahead of time. Still, you are allowed to observe how the coin performs over a number of flips. After that, you’ll need to make a “decision” about the coin. Explicitly:

• You’ve got a coin, which comes up heads with probability $w$. You don’t know $w$.
• You flip the coin $n$ times.
• You see $k$ heads and $n-k$ tails.
• You do something, depending on $k$. (We’ll come back to this.)

Simple enough, right? Remember, $k$ is the total number of heads after $n$ flips. If you do some math, you can work out a formula for $p(k\vert w,n)$: the probability of seeing exactly $k$ heads. For our purposes, it doesn’t really matter what that formula is, just that it exists. It’s known as a Binomial distribution, and so is sometimes written $\mathrm{Binomial}(k\vert n,w)$.

Here’s an example of what this looks like with $n=21$ and $w=0.5$.

Naturally enough, if $w=.5$, with $21$ flips, you tend to see around $10-11$ heads. Here’s an example with $w=0.2$. Here, the most common value is $4$, close to $21\times.2=4.2$.

## Decisions, decisions

After observing some coin flips, what do we do next? You can imagine facing various possible situations, but we will use the following:

Our situation: After observing n coin flips, you need to guess “heads” or “tails”, for one final coin flip.

Here, you just need to “decide” what the next flip will be. You could face many other decisions, e.g. guessing the true value of w.

Now, suppose that you have a friend who seems very skilled at predicting the final coinflip. What information would you need to reproduce your friend’s skill? All you need to know is if your friend predicts heads or tails for each possible value of k. We think of this as a decision rule, which we write abstractly as

$\mathrm{Dec}(k).$

This is just a function of one integer $k$. You can think of this as just a list of what guess to make, for each possible observation, for example:

 $k$ $\mathrm{Dec}(k)$ $0$ $\mathrm{tails}$ $1$ $\mathrm{heads}$ $2$ $\mathrm{heads}$ $\vdots$ $\vdots$ $n$ $\mathrm{tails}$

One simple decision rule would be to just predict heads if you saw more heads than tails, i.e. to use

$\mathrm{Dec}(k)=\begin{cases} \mathrm{heads}, & k\geq n/2 \\ \mathrm{tails} & k

The goal of statistics is to find the best decision rule, or at least a good one. The rule above is intuitive, but not necessarily the best. And… wait a second… what does it even mean for one decision rule to be “better” than another?

## Our goal: minimize the thing that’s designed to be minimized

What happens after you make a prediction? Consider our running example. There are many possibilities, but here are two of the simplest:

• Loss A: If you predicted wrong, you lose a dollar. If you predicted correctly, nothing happens.
• Loss B: If you predict “tails” and “heads” comes up, you lose 10 dollars. If you predict “heads” and “tails” comes up, you lose 1 dollar. If you predict correctly, nothing happens.

We abstract these through a concept of a loss function. We write this as

$L(w,d)$.

The first input is the true (unknown) value $w$, while second input is the “prediction” you made. We want the loss to be small.

Now, one point might be confusing. We defined our situation as predicting the next coinflip, but now $L$ is defined comparing $d$ to $w$, not to a new coinflip. We do this because comparing to $w$ gives the most generality. To deal with our situation, just use the average amount of money you’d lose if the true value of the coin were $w$. Take loss A. If you predict “tails”, you’ll be wrong with probability $w$, while if you predict “heads”, you’ll be wrong with probability $1-w$, and so lose $1-w$ dollars on average. This leads to the loss

$L_{A}(w,d)=\begin{cases} w & d=\mathrm{tails}\\ 1-w & d=\mathrm{heads} \end{cases}.$

For loss B, the situation is slightly different, in that you lose 10 times as much in the first case. Thus, the loss is

$L_{B}(w,d)=\begin{cases} 10w & d=\mathrm{tails}\\ 1-w & d=\mathrm{heads} \end{cases}.$

The definition of a loss function might feel circular– we minimize the loss because we defined the loss as the thing that we want to minimize. What’s going on? Well, a statistical problem has two separate parts: a model of the data generating process, and a loss function describing your goals. Neither of these things is determined by the other.

So, the loss function is part of the problem. Statistics wants to give you what you want. But you need to tell statistics what that is.

Despite the name, a “loss” can be negative– you still just want to minimize it. Machine learning, always optimistic, favors “reward” functions that are to be maximized. Plus ça change.

## Model + Loss = Risk

OK! So, we’ve got a model of our data generating process, and we specified some loss function. For a given w, we know the distribution over k, so… I guess… we want to minimize it?

Let’s define the risk to be the average loss that a decision rule gives for a particular value of w. That is,

$R(w,\mathrm{Dec})=\sum_{k}p(k\vert w,n)L(w,\mathrm{Dec}(k)).$

Here, the second input to $R$ is a decision rule– a precise recipe of what decision to make in each possible situation.

Let’s visualize this. As a set of possible decision rules, I will just consider rules that predict “heads” if they’ve seen at least m heads, and “tails” otherwise:

$\mathrm{Dec}_{m}(k)=\begin{cases} \mathrm{heads} & k\geq m\\ \mathrm{tails} & k

With $n=21$ there are $22$ such decision rules, corresponding to $m=0$, (always predict heads), $m=1$ (predict heads if you see at least one heads), up to $m=22$ (always predict tails). These are shown here:

These rules are intuitive: if you’d predict heads after observing 16 heads out of 21, it would be odd to predict tails after seeing 17 instead! It’s true that for losses $L_{A}$ and $L_{B}$, you don’t lose anything by restricting to this kind of decision rule. However, there are losses for which these decision rules are not enough. (Imagine you lose more when your guess is correct.)

With those decision rules in place, we can visualize what risk looks like. Here, I fix $w=0.2$, and I sweep through all the decision rules (by changing $m$) with loss $L_{A}$:

The value $R_A$ in the bottom plot is the total area of the green bars in the middle. You can do the same sweep for $w=0.4$, which you can is pictured here:

We can visualize the risk in one figure with various $w$ and $m$. Notice that the curves for $w=0.2$ and $w=0.4$ are exactly the same as we saw above.

Of course, we get a different risk depending on what loss function we use. If we repeat the whole process using loss $L_{B}$ we get the following:

## Dealing with risk

What’s the point of risk? It tells us how good a decision rule is. We want a decision rule where risk is as low as possible. So you might ask, why not just choose the decision rule that minimizes $R(w,\mathrm{Dec})$?

The answer is: because we don’t know $w$! How do we deal with that? Believe it or not, there isn’t a single well-agreed upon “right” thing to do, and so we meet two different schools of thought.

### Option 1 : All probability all the time

Bayesian statistics (don’t ask about the name) defines a “prior” distribution $p(w)$ over $w$. This says which values of $w$ we think are more and less likely. Then, we define the Bayesian risk as the average of $R$ over the prior:

$R_{\mathrm{Bayes}}(\mathrm{Dec})=\int_{w=0}^{1}p(w)R(w,\mathrm{Dec})dw.$

This just amounts to “averaging” over all the risk curves, weighted by how “probable” we think $w$ is. Here’s the Bayes risk corresponding to $L_{A}$ with a uniform prior $p(w)=1$:

For reference, the risk curves $R(w,\mathrm{Dec}_m)$ are shown in light grey. Naturally enough, for each value of $m$, the Bayes risk is just the average of the regular risks for each $w$.

Here’s the risk corresponding to $L_{B}$:

That’s all quite natural. But we haven’t really searched through all the decision rules, only the simple ones $\mathrm{Dec}_m$. For other losses, these simple ones might not be enough, and there are a lot of decision rules. (Even for this toy problem there are $2^{22}$, since you can output heads or tails for each of $k=0$, $k=1$, …, $k=21$.)

Fortunately, we can get a formula for the best decision rule for any loss. First, re-write the Bayes risk as

$R_{\mathrm{Bayes}}(\mathrm{Dec})=\sum_{k} \left( \int_{w=0}^{1}p(w)p(k\vert n,w)L(w,\mathrm{Dec}(k))dw \right).$

This is a sum over $k$ where each term only depends on a single value $\mathrm{Dec}(k)$. So, we just need to make the best decision for each individual value of $k$ separately. This leads to the Bayes-optimal decision rule of

$\mathrm{Dec}_{\text{Bayes}}(k)=\arg\min_{d}\int_{w=0}^{1}p(w)p(k\vert w,n)L(w,d)dw.$

With a uniform prior $p(w)=1$, here’s the optimal Bayesian decision rules with loss $L_{A}$:

And here it is for loss $L_B$:

Look at that! Just mechanically plugging the loss function into the Bayes-optimal decision rule naturally gives us the behavior we expected– for $L_{B}$, the rule is very hesitant to predict tails, since the loss is so high if you’re wrong. (Again, these happen to fit in the parameterized family $\mathrm{Dec}_{m}$ defined above, but we didn’t use this assumption in deriving the rules.)

The nice thing about the Bayesian approach is that it’s so systematic. No creativity or cleverness is required. If you specify the data generating process ($p(k\vert w,n)$) the loss function ($L(w,d)$) and the prior distribution ($p(w$)) then the optimal Bayesian decision rule is determined.

There are some disadvantages as well:

• Firstly, you need to make up the prior, and if you do a terrible job, you’ll get a poor decision rule. If you have little prior knowledge, this can feel incredibly arbitrary. (Imagine you’re trying to estimate Big G.) Different people can have different priors, and then get different results.
• Actually computing the decision rule requires doing an integral over w, which can be tricky in practice.
• Even if your prior is good, the decision rule is only optimal when averaged over the prior. Suppose, for every day for the next 10,000 years, a random coin is created with $w$ drawn from $p(w)$. Then, no decision rule will incur less loss than $\mathrm{Dec}_{\text{Bayes}}$. However, on any particular day, some other decision rule could certainly be better.

So, if you have little idea of your prior, and/or you’re only making a single decision, you might not find much comfort in the Bayesian guarantee.

Some argue that these aren’t really disadvantages. Prediction is impossible without some assumptions, and priors are upfront and explicit. And no method can be optimal for every single day. If you just can’t handle that the risk isn’t optimal for each individual trial, then… maybe go for a walk or something?

### Option 2 : Be pessimistic

Frequentist statistics (Why “frequentist”? Don’t think about it!) often takes a different path. Instead of defining a prior over w, let’s take a worst-case view. Let’s define the worst-case risk as

$R_{\mathrm{worst}}(\mathrm{Dec})=\max_{w}R(w,\mathrm{Dec}).$

Then, we’d like to choose an estimator to minimize the worst-case risk. We call this a “minimax” estimator since we minimize the max (worst-case) risk.

Let’s visualize this with our running example and $L_{A}$:

As you can see, for each individual decision rule, it searches over the space of parameters $w$ to find the worst case. We can visualize the risk with $L_{B}$ as:

What’s the corresponding minimax decision rule? This is a little tricky to deal with– to see why, let’s expand the worst-case risk a bit more:

$R_{\mathrm{Worst}}(\mathrm{Dec})=\max_{w}\sum_{k}p(k\vert n,w)L(w,\mathrm{Dec}(k)).$

Unfortunately, we can’t interchange the max and the sum, like we did with the integral and the sum for Bayesian decision rules. This makes it more difficult to write down a closed-form solution. At least in this case, we can still find the best decision rule by searching over our simple rules $\mathrm{Dec}_m$. But be very mindful that this doesn’t work in general!

For $L_{A}$ we end up with the same decision rule as when minimizing Bayesian risk:

For $L_{B}$, meanwhile, we get something slightly different:

This is even more conservative than the Bayesian decision rule. $\mathrm{Dec}_{B-\mathrm{Bayes}}(2)=\mathrm{tails}$, while $\mathrm{Dec}_{B-\mathrm{minimax}}(2)=\mathrm{heads}$. That is, the Bayesian method predicts heads when it observes 2 or more, while the minimax rule predicts heads if it observes even one. This makes sense intuitively: The minimax decision rule proceeds as if the “worst” w (a small number) is fixed, whereas the Bayesian decision rule less pessimistically averages over all w.

Which decision rule will work better? Well, if w happens to be near the worst-case value, the minimax rule will be better. If you repeat the whole experiment many times with w drawn from the prior, the Bayesian decision rule will be.

If you do the experiment at some w far from the worst-case value, or you repeat the experiment many times with w drawn from a distribution different from your prior, then you have no guarantees.

Neither approach is “better” than the other, they just provide different guarantees. You need to choose what guarantee you want. (You can kind of think of this as a “meta” loss.)

## So what about all those formulas, then?

For real problems, the data generating process is usually much more complex than a Binomial. The “decision” is usually more complex than predicting a coinflip– the most common decision is making a guess for the value of $w$. Even calculating $R(w,\mathrm{Dec})$ for fixed $w$ and $\mathrm{Dec}$ is often computationally hard, since you need to integrate over all possible observations. In general, finding exact Bayes or minimax optimal decision rules is a huge computational challenge, and at least some degree of approximation is required. That’s the game, that’s why statistics is hard. Still, even for complex situations the rules are the same– you win by finding a decision rule with low risk.

# You deserve better than two-sided finite differences

In calc 101, the derivative is derived as $df/dx = \lim_{\epsilon \rightarrow 0} (f(x+\epsilon)-f(x))/\epsilon$. So, if you want to estimate a derivative, an easy way to do so would be to just pick some small $\epsilon$ and estimate:

$df/dx \approx \frac{1}{\epsilon} (f(x+\epsilon)-f(x))$

This can work OK. Let’s look at an example of trying to calculate the derivative of $\log (x)$, using a range of different $\epsilon$

What’s happening? Well, the result is true mathematically in the limit that $\epsilon$ is small, so it’s natural to get errors for large $\epsilon$. However, with very small $\epsilon$ you run into trouble because floating-point arithmetic can only represent finite precision. Let’s try again with a smaller value of $x=.001$

That’s somewhat concerning. We can still get a nearly correct value, but we have a limited range of steps that achieve it. This is very well-known in numerical analysis, and a common solution is to use two-sided differences, i.e. to estimate

$df/dx \approx \frac{1}{2 \epsilon} (f(x+\epsilon)-f(x-\epsilon)).$

If we try that, we indeed get better results:

What’s happening here? Basically, we are using more information about the function to make a higher-order approximation, so we can mathematically get away with using a larger value of $\epsilon$. This in turn isis helpful to avoid the numerical precision demons.

Great! But let’s go deeper. If we use $x = 10^{-5}$, we get:

Huh. 1-sided differences totally fall apart, but we seem to be running into more trouble. But never fear, you can use “four-sided differences”!

$df/dx \approx \frac{1}{12 \epsilon} (-f(x+2\epsilon)+8 f(x+\epsilon)-8 f(x-\epsilon)+ f(x-2 \epsilon))$

Then, we get what you might expect:

But what if we go deeper, with $x=10^{-7}$?

Or maybe we should go even deeper, with $x = 10^{-9}$?

Even the four-sided differences have failed us. Now, you might take a lesson here that you shouldn’t be using numerical differences (and I sort of agree). Those of certain temperament, on the other hand, would say instead that what we need is power. OK then, how do six-sided differences sound to you? Sound good?

$df/dx \approx \frac{1}{60 \epsilon}(- f(x-3 \epsilon)+9 f(x-2 \epsilon)-45 f(x-\epsilon)+45 f(x+\epsilon)-9 f(x+2\epsilon)+f(x+3 \epsilon))$

This is still tough, but there is at least some range of epsilon where you can calculate a reasonable derivative. There is, of course, a never-ended sequence of these higher-order derivative approximations. There’s even a calculator, for all your bespoke finite-difference stencil needs.

Now, you might think that the real solution here is to use automatic differentiation, and you’re mostly right. It takes more computation to sample the function at more points, and no number of samples will fundamentally stop the numerical demons from destroying your result.

However, there still remain cases where it’s worthwhile to manually compute a derivative. More importantly perhaps, when you’re implementing an automatic differentiation tool, you still need to test that it is actually correct! Here, numerically differences will probably forever remain useful. So, certainly for the cases of building a test suite for autodiff code, it certainly makes sense to use these higher-order derivatives.

(I originally intended to leave this as a comment on Tim Viera’s post on testing gradient implementations, but this kinda got out of control.)

# Sneaking up on Bayesian Inference (A fable in four acts)

Act 1: Magical Monkeys

Two monkeys, Alfred (${a}$) and Betty (${b}$) live in a parallel universe with two kinds of blocks, green (${g}$) and yellow (${y}$). Alfred likes green blocks, and Betty prefers the yellow blocks. One day, a Wizard decides to give one of the monkeys the magical power to send one block over to our universe each day.

The Wizard chooses the magical monkey by flipping a fair four-sided die. He casts a spell on Alfred if the outcome is 1, and Betty if the outcome is ${2-4}$. That is, the Wizard chooses Alfred with probability ${.25}$ and Betty with probability ${.75}$. Both Alfred and Betty send their favorite colored block with probability ${.8}$

After the Wizard has chosen, we see the first magical block, and it is green. Our problem is: What is the probability that Alfred is the magical monkey?

Intuitively speaking, we have two somewhat contradictory pieces of information. We know that the Wizard chooses Betty more often. But green is Alfred’s preferred color. Given probabilities above, is Alfred or Betty more probable?

First, we can write down all the probabilities. These are

\displaystyle \begin{aligned}Pr(M=a) & =.25\\ Pr(M=b) & =.75\\ Pr(B=g\vert M=a) & =.8\\ Pr(B=y\vert M=a) & =.2\\ Pr(B=g\vert M=b) & =.2\\ Pr(B=y\vert M=b) & =.8. \end{aligned}

Now, the quantity we are interested in is ${Pr(M=a\vert B=g).}$ We can mechanically apply Bayes’ rule to obtain

\displaystyle \begin{aligned}Pr(M=a\vert B=g) & =\frac{1}{Pr(B=g)}Pr(M=a)Pr(B=g\vert M=a)\\ & =\frac{1}{Pr(B=g)}.25\times.8. \end{aligned}

Similarly, we can calculate that

$\displaystyle Pr(M=b\vert B=g)=\frac{1}{Pr(B=g)}.75\times.2.$

But, of course, we know that one of the monkeys is magical, so these quantities must sum to one. Thus (since they both involve the quantity ${Pr(B=g)}$ that we haven’t explicitly calculated) we can normalize to obtain that

$\displaystyle Pr(M=a\vert B=g)=\frac{.25\times.8}{.25\times.8+.75\times.2}=\frac{4}{7}.$

Act 2: Magical Monkeys on Multiple Days

We wait around two more days, and two more blocks appear, both of which are yellow. Assume that the monkeys make an independent choice each day to send their favorite or less favorite block. Now, what is the probability that Alfred is the magical monkey, given that ${B=gyy}$?

Intuitively speaking, what do we expect? Betty is more likely to be chosen, and 2/3 of the blocks we’ve seen are suggestive of Betty (since she prefers yellow).

Now, we can mechanically calculate the probabilities. This is just like before, except we use the fact that the blocks seen on each day are conditionally independent given a particular monkey.

\displaystyle \begin{aligned}Pr(M=a\vert B=gyy) & =\frac{1}{Pr(B=gyy)}Pr(M=a)Pr(B=gyy\vert M=a)\\ & =\frac{1}{Pr(B=gyy)}Pr(M=a)Pr(B=g\vert M=a)Pr(B=y\vert M=a)Pr(B=y\vert M=a)\\ & =\frac{1}{Pr(B=gyy)}.25\times.8\times.2\times.2\\ Pr(M=b\vert B=gyy) & =\frac{1}{Pr(B=gyy)}Pr(M=b)Pr(B=g\vert M=b)Pr(B=y\vert M=b)Pr(B=y\vert M=b)\\ & =\frac{1}{Pr(B=gyy)}.75\times.2\times.8\times.8 \end{aligned}

Again, we know that these two probabilities sum to one. Thus, we can normalize and get that

$\displaystyle Pr(M=a\vert B=gyy)=\frac{.25\times.8\times.2\times.2}{.25\times.8\times.2\times.2+.75\times.2\times.8\times.8}=\frac{1}{13}.$

So now, Betty looks more likely by far.

Act 3: Magical Monkeys and the Weather

Now, suppose the Wizard gives us some additional information. The magical monkey (whilst relaxing over in the other universe) is able to perceive our weather, which is either clear ${(c)}$ or rainy ${(r)}$. Both the monkeys prefer clear weather. When the weather is clear, they send their preferred block with probability ${.8}$, while if the weather is rainy, they angrily send their preferred block with probability ${.2.}$ That is, we have

\displaystyle \begin{aligned}Pr(B=g\vert M=a,W=c) & =.8\\ Pr(B=g\vert M=a,W=r) & =.2\\ Pr(B=g\vert M=b,W=c) & =.2\\ Pr(B=g\vert M=b,W=r) & =.8. \end{aligned}

Along with seeing the previous sequence of blocks ${B=gyy}$, we observed the weather sequence ${W=rrc}$. Now, what is the probability that Alfred is the magical monkey?

We ask the Wizard what the distribution over rainy and clear weather is. The wizard haughtily responds that this is irrelevant, but does confirm that the weather is independent of which monkey was made magical.

Can we do anything without knowing the distribution over the weather? Can we calculate the probability that Alfred is the magical monkey?

Let’s give it a try. We can apply Bayes’ equation to get that

\displaystyle \begin{aligned}Pr(M=a\vert B=gyy,W=rrc) & =\frac{1}{Pr(B=gyy,W=rrc)}Pr(M=a)Pr(B=gyy,W=rrc\vert M=a).\end{aligned}

Now, since the weather is independent of the monkey, we know that

\displaystyle \begin{aligned}Pr(B=gyy,W=rrc\vert M=a) & =Pr(W=rrc\vert M=a)Pr(B=gyy\vert M=a,W=rrc)\\ & =Pr(W=rrc)Pr(B=gyy\vert M=a,W=rrc). \end{aligned}

Thus, we have that

\displaystyle \begin{aligned}Pr(M=a\vert B=gyy,W=rrc) =&\frac{Pr(W=rrc)}{Pr(B=gyy,W=rrc)}Pr(M=a)Pr(B=gyy\vert M=a,W=rrc)\\ =&\frac{Pr(W=rrc)}{Pr(B=gyy,W=rrc)}Pr(M=a) \\ & \times Pr(B=g\vert M=a,W=r)Pr(B=y\vert M=a,W=r)Pr(B=y\vert M=a,W=c)\\ =&\frac{Pr(W=rrc)}{Pr(B=gyy,W=rrc)}.25\times.2\times.8\times.8. \end{aligned}

Through the same logic we can calculate that

$\displaystyle Pr(M=b\vert B=gyy,W=rrc)=\frac{Pr(W=rrc)}{Pr(B=gyy,W=rrc)}.75\times.8\times.2\times.2.$

Again, we know that these probabilities need to sum to one. Since the factor ${Pr(W=rrc)/Pr(B=gyy,W=rrc)}$ is constant between the two, we can normalize it out. Thus, we get that

$\displaystyle Pr(M=a\vert B=gyy,W=rrc)=\frac{.25\times.2\times.8\times.8.}{.25\times.2\times.8\times.8.+.75\times.8\times.2\times.2.}=\frac{4}{7}.$

Again, it looks like Alfred was the more likely monkey. And– oh wait– we somehow got away with not knowing the distribution over the weather…

Act 4: Predicting the next block

Now, suppose that after seeing the previous sequences (namely ${B=gyy}$ and ${W=rrc}$) , on the fourth day, we find that it has rained. What is the probability that we will get a green block on the fourth day? Mathematically, we want to calculate

$\displaystyle Pr(B'=g\vert W'=r,B=ggy,W=rrc).$

How can we go about this? Well, if we knew that the magical monkey was Alfred (which we do not!), it would be easy to calculate. We would just have

$\displaystyle Pr(B'=g\vert W'=r,M=a),$

and similarly if the magical monkey were Betty. Now, we don’t know which monkey is magical, but we know the probabilities that each monkeys is magical given the available information– we just calculated them in the previous section! So, we can factorize the distribution of interest as

\displaystyle \begin{aligned}Pr(B'=g\vert W'=r,B=ggy,W=rrc) & =Pr(B'=g\vert W'=r,M=a)Pr(M=a\vert B=ggy,W=rrc)\\ & +Pr(B'=g\vert W'=r,M=b)Pr(M=b\vert B=ggy,W=rrc)\\ & =.2\times\frac{4}{7}+.8\times\frac{3}{7}\\ & =\frac{33}{70} \end{aligned}

So we are slightly less likely than even to see a green block on the next day.

Epilogue: That’s Bayesian Inference

This is how Bayesian inference works (in the simplest possible setting). In general Bayesian inference, you have:

• A set of possible inputs (for us these are weather patterns)
• A set of possible outputs (for us these are colored blocks)
• A set of possible models, that map a given input to a distribution over the outputs (for us these are monkeys)
• A “prior” distribution over which models are more likely (for us, this is the wizard and the four-sided die)
• A dataset of inputs and outputs (for us, this is ${W=rrc}$ and ${B=ggy}$)
• A new input (for us this is ${W'=r}$)
• An unknown output that we’d like to predict (for us this is ${B'}$)

Bayesian inference essentially proceeds in two steps:

• In the first step, one uses the prior over which models are more likely along with the dataset of inputs and outputs to build a “posterior” distribution over which models are more likely. We did this by using

$\displaystyle Pr(M\vert W,B)\propto Pr(M)\prod_{i}Pr(B_{i}\vert W_{i},M),$

where ${i}$ indexes the different days of observation.

• In the second step, one “integrates over the posterior”. That is, each model gives us a distribution over ${B'}$ given the new input ${W'}$. So, we simply sum over the different models

$\displaystyle Pr(B'\vert W')=\sum_{m}Pr(M=m\vert W,B)Pr(B'\vert W',M=m).$

In the general case, the set of possible models is much larger (typically infinite) and more complex computational methods need to be used to integrate over it. (Commonly Markov chain Monte Carlo or variational methods). Also, of course, we don’t typically have a Wizard telling us exactly what the prior distribution over the models is, meaning one must make assumptions, or try to “learn” the prior as in, say, empirical Bayesian methods.

# Favorite things NIPS

I always enjoy reading conference reports, so I thought I’d mention a few papers that caught my eye.  (I welcome any corrections to my summaries of any of these.)

1. Recent Progress in the Structure of Large-Treewidth Graphs and Some Applications, by Chandra Chekuri. Not a paper, but rather an exceptionally clear tutorial on material that should probably be mandatory for anyone interested in computational complexity and graphical models. The talk isn’t online yet, but slides are available.

2. Learning Distributed Representations for Structured Output Prediction, by Vivek Srikumar and Christopher Manning. The idea is that when doing structured prediction, labels are often “similar” in some sense, and so should be encouraged to have similar weights. This is done by defining weights not over a discrete label space, but over a finite-dimensional distributed representation, with one vector for each label. Learning alternates between updating the representation of each label and regular structured learning.

3. Clamping Variables and Approximate Inference, by Adrian Weller and Tony Jebara. The paper provides a first-principles based proof that the Bethe approximation gives a lower-bound on the true partition function for attractive binary pairwise models. (A result originally proved by Ruozzi in 2012) The basic idea is that “clamping” (i.e. brute-force summing over all values of a variable, and using the Bethe approximation on the resulting model) can only increase the Bethe approximation. But, if you clamp everything, you have the exact partition function, so Bethe must be a lower-bound.  The proofs provide a lot of insight into why the results are true.

4. Making Pairwise Binary Graphical Models Attractive, by Nicholas Ruozzi and Tony Jebara. This builds on similar ideas to the above paper. There is an idea of a cover of a graph, where vertices have multiple copies, but neighborhoods look locally the same. The paper gives an appropriate cover construction such that the cover is attractive (up to a “flipping” of the variables). Thus, the Bethe partition function of the cover lower-bounds the true partition function of the cover. This is shown to be a lower-bound on the tree-reweighted partition function. As far as I can tell, Bethe on the cover may upper or lower-bound the true partition function, so Bethe-on-cover isn’t necessarily better than tree-reweighted.  Still, since tree-reweighted is typically a loose upper-bound, it is likely to be better in practice.

5. Distributed Bayesian Posterior Sampling via Moment Sharing, by Minjie Xu, Balaji Lakshminarayanan, Yee Whye Teh, Jun Zhu, and Bo Zhang. You want to do Bayesian MCMC inference using a cluster. This paper partitions the data over the nodes of the cluster, does local MCMC approximations, and uses an exponential-family approximation and expectation propagation to send information about the posterior between the nodes, thus approximating full/normal Bayesian inference. A couple days ago Andrew Gelman mentioned a similar method, considering other local approximations.

6. Mode Estimation for High Dimensional Discrete Tree Graphical Models, by Chao Chen, Han Liu, Dimitris Metaxas, and Tianqi Zhao. The goal is to find the M highest modes in a tree-structured graphical model. They find a set of necessary local conditions to be a mode, and then propose a junction-tree type algorithm operating with local mode configurations as values to find global modes.

7. Efficient Inference of Continuous Markov Random
Fields with Polynomial Potentials
, by Shenlong Wang, Alexander Schwing, and Raquel Urtasun. The goal is to find a local minimum of a polynomial MRF by use of the CCCP method where one iteratively splits the objective into convex and concave parts and linearizes the concave part to get an upper-bound. I’d have expected this to be easy, but it turns out to be quite difficult to split up a polynomial into convex and concave parts. This paper gives an SDP problem to find a function to add to the objective to make it convex, yielding the desired split.

8. Hardness of parameter estimation in graphical models, by Guy Bresler, David Gamarnik, and Devavrat Shah. In principle, this is the ultimate non-surprising result: given the mean parameters (data marginals), it is computationally intractable to find the natural parameters (MRF weights) that produce these marginals. Formally given an oracle to do maximum likelihood learning in the normal way (find parameters to match a given mean vector), you could use that oracle to approximate the partition function, which is known to be hard. The really surprising thing here is that this wasn’t proved until 2014!

9. A* Sampling, by Chris Maddison, Daniel Tarlow, and Tom Minka. A method for exact sampling for low-dimensional continuous distributions, based on fundamentally different principles from existing methods. Intuitively, the algorithm simultaneously generates Gumbel-type noise for every possible (real-valued) configuration and optimizes the resulting function by branch-and-bound.

10. The workshops, as usual, were great aside from the usual annoyance of it only being possible to be in one place at one time.

11. Montreal.  Charming, friendly.

# Truncated Bi-Level Optimization

In 2012, I wrote a paper that I probably should have called “truncated bi-level optimization”.  I vaguely remembered telling the reviewers I would release some code, so I’m finally getting around to it.

The idea of bilevel optimization is quite simple.  Imagine that you would like to minimize some function $L(w)$.  However, $L$ itself is defined through some optimization.  More formally, suppose we would like to solve

$\min_{w,y} Q(y), \text{ s.t. } y = \arg\min_y E(y,w).$

Or, equivalently,

$\min_{w} L(w)=Q(y^*(w)),$

where $y^*$ is defined as $y^*(w) := \arg\min_y E(y,w)$.  This seems a little bit obscure at first, but actually comes up in several different ways in machine learning and related fields.

## Hyper-parameter learning

The first example would be in learning hyperparameters, such as regularization constants.  Inevitably in machine learning, one fits parameters parameters to optimize some tradeoff between the quality of a fit to training data and a regularization function being small.  Traditionally, the regularization constant is selected by optimizing on a training dataset with a variety of values, and then picking the one that performs best on a held-out dataset.  However, if there are a large number of regularization parameters, a high-dimensional grid-search will not be practical.  In the notation above, suppose that $w$ is a vector of regularization constants, and that $y$ are training parameters.  Let, $E$  be the regularized empirical risk on a training dataset, and let $Q$ be how well the parameters $y^*(w)$ perform on some held-out validation dataset.

## Energy-based models

Another example (and the one suggesting the notation) is an energy-based model.  Suppose that we have some “energy” function $E_x(y,w)$ which measures how well an output $y$ fits to an input $x$.  The energy is parametrized by $w$.  For a given training input/output pair $(\hat{x},\hat{y})$, we might have that $Q_{\hat{y}}(y^*(w))$ measures how how the predicted output $y^*$ compares to the true output $\hat{y}$, where $y^*(w)=\arg\max_y E_{\hat{x}}(y,w)$.

Even if we just have the modest goal of following the gradient of $L(w)$ to a local minimum, even computing the gradient is not so simple.  Clearly, even to evaluate $L(w)$ requires solving the “inner” minimization of $E$. It turns out that one can compute $\nabla_w L(w)$ through first solving the inner minimization, and then solving a linear system.

1. Input $w$
2. Solve $y^* \leftarrow \arg\min_y E(y,w)$.
3. Compute:
4.    (a) the loss $L(w)= Q(y^*)$
5.    (b) the gradient $g=\nabla_y Q(y^*)$
6.    (c) the Hessian $H=\frac{\partial^2 E(y^*,w)}{\partial w\partial w^T}$
7. Solve the linear system $z=H^{-1} g$.
8. Compute the parameter gradient $\nabla_w L(w) = - \frac{\partial^2 E(y^*,w)}{\partial w\partial y^T} z$
9. Return $L(w)$ and $\nabla_w L(w)$.

This looks a bit nasty, since we need to compute second-derivative matrices of $E$.  In fact, as long as one has a routine to compute $\nabla_y E$ and $\nabla_w E$, this can be avoided through Efficient Matrix-vector products.  This is essentially proposed by Do, Foo, and Ng in “Efficient multiple hyperparameter learning for log-linear models”.

Overall, this is a decent approach, but it can be quite slow, simply because one must solve an “inner” optimization in order to compute each gradient of the “outer” optimization.  Often, the inner-optimization needs to be solved to very high accuracy in order to estimate a gradient accurately enough to reduce $L(w)$— higher accuracy than is needed when one is simply using the predicted value $y^*$ itself.

## Truncated optimization

To get around this expense, a fairly obvious idea is to re-define the problem.  The slowness of exactly computing the gradient stems from needing to exactly solve the inner optimization.  Hence, perhaps we re-define the problem such that an inexact solve of the inner problem nevertheless yields an “exact” gradient?

Re-define the problem as solving

$\min_{w} L(w)=Q(y^*(w)), \text{ } y^*(w) := \text{opt-alg}_y E(y,w)$,

where $\text{opt-alg}$ denotes an approximate solve of the inner optimization.  In order for this to work, $\text{opt-alg}$ must be defined in such a way that $y^*(w)$ is a continuous function of $w$.  With standard optimization methods such as gradient descent or BFGS, this can be achieved by assuming there are a fixed number of iterations applied, with a fixed step-size.  Since each iteration of these algorithms is continuous, this clearly defines $y^*(w)$ as a continuous function.  Thus, in principle, it could be optimized efficiently through automatic differentiation of the code that optimizes $E$.  That’s fine in principle, but often inconvenient in practice.

It turns out, however, that one can derive “backpropagating” versions of algorithms like gradient descent, that take as input only a procedure to compute $E$ along with it’s first derivatives.  These algorithms can then produce the gradient of $L$ in the same time as automatic differentiation.

If the inner-optimization is gradient descent for $N$ steps with a step-size of $\lambda$, the algorithm to compute the loss is simple:

1. Input $w$
2. For $k=0,1,...,N-1$
3.     (a) $y_{k+1} \leftarrow y_k - \lambda \nabla_y E(y_k,w)$
4. Return $L(w) = Q(y_N)$

How to compute the gradient of this quantity?  The following algorithm does the trick.

1. $\overleftarrow{y_N} \leftarrow \nabla Q(y_N)$
2. $\overleftarrow{w} \leftarrow 0$
3. For $k=N-1,...,0$
4.     (a) $\overleftarrow{w} \leftarrow \overleftarrow{w} - \lambda \frac{\partial^2 E(y_k,w)}{\partial w \partial y^T} \overleftarrow{y_{k+1}}$
5.     (b) $\overleftarrow{y_k} \leftarrow \overleftarrow{y_{k+1}} - \lambda \frac{\partial^2 E(y_k,w)}{\partial y \partial y^T} \overleftarrow{y_{k+1}}$
6. Return $\nabla L = \overleftarrow{w}$.

Similar algorithms can be derived for the heavy-ball algorithm (with a little more complexity) and limited memory BFGS (with a lot more complexity).

## Code

So, finally, here is the code, and I’ll give a simple example of how to use it. There are just four simple files:

I think the meanings of this are pretty straightforward, so I’ll just quickly step through the demo here.  I’ll start off by grabbing taking one of Matlab’s built-in datasets (on cities) so that we are trying to predict a measure of crime from measures of climate, housing, health, transportation, arts, recreation, and economy, as well as a constant.  There are 329 data, total, which I split into a training set of size 40, a validation set of size 160, and a test set of size 129.

load cities ratings
X = ratings;
for i=1:size(X,2)
X(:,i) = X(:,i) - mean(X(:,i));
X(:,i) = X(:,i) / std( X(:,i));
end

% predict crime from climate, housing, health, trans, edu, arts, rec, econ,
Y = X(:,4);
X = [X(:,[1:3 5:9]) 1+0*X(:,1)];

p = randperm(length(Y));
X = X(p,:);
Y = Y(p);

whotrain = 1:50;
whoval   = 51:200;
whotest  = 201:329;

Xtrain = X(whotrain,:);
Xval   = X(whoval  ,:);
Xtest  = X(whotest ,:);
Ytrain = Y(whotrain);
Yval   = Y(whoval  );
Ytest  = Y(whotest );

Next, I’ll set up some simple constants that will be used later on, and define the optimization parameters for minFunc, that I will be using for the outer optimization. In particular, here I choose the inner optimization to use 20 iterations.

opt_iters = 20;
ndims = size(Xtrain,2); w0    = zeros(ndims,1); ndata = size(Xtrain,1); ndata_val = size(Xval,1);
options = []; options.Method  = 'gd'; options.LS      = 1; options.MaxIter = 100;

Now, I’ll define the training risk function ($E$ in the notation above). The computes the risk with a regularization constant of $a$, as well as derivatives. I’ll also define the validation risk ($Q$ in the notation above).

function [R dRdw dRdloga] = training_risk(w,loga)
a         = exp(loga);
R         = sum( (Xtrain*w - Ytrain).^2 )/ndata + a*sum(w.^2);
dRdw      = 2*Xtrain'*(Xtrain*w-Ytrain)  /ndata + 2*a*w;
dRda      = sum(w.^2);
dRdloga = dRda*a;
end

function [R g] = validation_risk(w)
R = sum( (Xval*w - Yval).^2 ) / ndata_val;
g = 2*Xval'*(Xval*w-Yval)     / ndata_val;
end

Now, before going any further, let’s do a traditional sweep through regularization constants to see what that looks like.

LAMBDA = -5:.25:2;
VAL_RISK = 0*LAMBDA;
for i=1:length(LAMBDA)
VAL_RISK(i) = back_lbfgs(@training_risk,@validation_risk,w0,LAMBDA(i),opt_iters);
end

Plotting, we get the following:

This is a reasonable looking curve. Instead, let’s ask the algorithm to find the constant by gradient descent.

eval = @(loga) back_lbfgs(@training_risk,@validation_risk,w0,loga,opt_iters);
loga = 0;
[loga fval] = minFunc(eval,loga,options);

Running the optimization, we see:

Iteration   FunEvals     Step Length    Function Val        Opt Cond
1          2     1.00000e+00     8.74176e-01     3.71997e-03
2          3     1.00000e+00     8.73910e-01     1.86453e-04
3          4     1.00000e+00     8.73909e-01     1.06619e-05
4          5     1.00000e+00     8.73909e-01     3.60499e-08

This leads to a regularizer of the form:

$0.860543 ||w||_2^2$.

We can plot this on the graph, and see it matches the result of cross-validation.

If we actually compute the test-set error, this is 0.708217.

OK, let’s be a little bit more adventurous, and use a third-order regularizer. This is done like so:

    function [R dRdw dRdloga] = training_risk2(w,loga)
a         = exp(loga(1));
b         = exp(loga(2));
R         = sum( (Xtrain*w - Ytrain).^2 ) / ndata + a*sum(abs(w).^2)     + b*sum(abs(w).^3);
dRdw      = 2*Xtrain'*(Xtrain*w-Ytrain)   / ndata + a*abs(w).^1.*sign(w) + b*abs(w).^2.*sign(w);
dRda      = sum(abs(w).^2);
dRdb      = sum(abs(w).^3);
dRdloga = [dRda*a; dRdb*b];
end

Running the optimization, we see:

Iteration   FunEvals     Step Length    Function Val        Opt Cond
1          2     1.00000e+00     8.74445e-01     1.17262e-02
2          3     1.00000e+00     8.73685e-01     3.21956e-03
3          4     1.00000e+00     8.73608e-01     1.41744e-03
4          5     1.00000e+00     8.73598e-01     8.20040e-04
5          6     1.00000e+00     8.73567e-01     1.39830e-03
6          7     1.00000e+00     8.73513e-01     2.52994e-03
7          8     1.00000e+00     8.73471e-01     1.77157e-03
...
23         28     1.00000e+00     8.70741e-01     8.42628e-06

With a final regularizer of the form

$0.000160 ||w|_2^2 + 5.117057 * ||w||_3^3$

and a hardly-improved test error of 0.679155.

Finally, let’s fit a fourth-order polynomial.

    function [R dRdw dRdloga] = training_risk3(w,loga)
a         = exp(loga);
R         = sum( (Xtrain*w - Ytrain).^2 ) / ndata;
dRdw      = 2*Xtrain'*(Xtrain*w-Ytrain)   / ndata;
for ii=1:length(a)
b=ii+1;
R    = R    + a(ii)*sum(abs(w).^b);
dRdw = dRdw + a(ii)*abs(w).^(b-1).*sign(w);
dRda(ii,1) = sum(abs(w).^b);
end
dRdloga = dRda.*a;
end

eval = @(loga) back_lbfgs(@training_risk3,@validation_risk,w0,loga,opt_iters);
loga = [0; 0; 0; 0];
loga = minFunc(eval,loga,options);

Running the optimization, we see

Iteration   FunEvals     Step Length    Function Val        Opt Cond
1          2     1.00000e+00     8.73982e-01     1.02060e-02
2          3     1.00000e+00     8.73524e-01     3.30445e-03
3          4     1.00000e+00     8.73447e-01     1.74779e-03
4          5     1.00000e+00     8.73435e-01     1.43655e-03
5          6     1.00000e+00     8.73345e-01     4.98340e-03
6          7     1.00000e+00     8.73295e-01     1.85535e-03
7          8     1.00000e+00     8.73231e-01     1.81136e-03
...
38         41     1.00000e+00     8.69758e-01     3.44848e-06

Yielding the (strange) regularizer

$0.000000 ||w||_2^2 + 0.000010 ||w||_3^3 + 24.310631 ||w||_4^4 + 0.325565 ||w||_5^5$

and a final test-error of 0.67187.

# Reducing Sigmoid computations by (at least) 88.0797077977882%

A classic implementation issue in machine learning is reducing the cost of computing the sigmoid function

$\sigma(a) = \frac{1}{1+\exp(-a)}$.

Specifically, it is common to profile your code and discover that 90% of the time is spent computing the $\exp$ in that function.  This comes up often in neural networks, as well as in various probabilistic architectures, such as sampling from Ising models or Boltzmann machines.  There are quite a few classic approximations to the function, using simple polynomials, etc. that can be used in neural networks.

Today, however, I was faced with a sampling problem involving the repeated use of the sigmoid function, and I noticed a simple trick that could reduce the number of sigmoids by about 88% without introducing any approximation.  The particular details of the situation aren’t interesting, but I repeatedly needed to do something like the following:

1. Input $a \in \Re$
2. Compute a random number $r \in [0,1]$
3. If $r < \sigma(a)$
4.   Output $+1$
5. Else
6.     Output $-1$

Now, let’s assume for simplicity that $a$ is positive.  (Otherwise, sample using $-a$ and then switch the sign of the output.)  There are two observations to make:

1. If $a$ is large, then you are likely to output $+1$
2. Otherwise, there are easy upper and lower bounds on the probability of outputting $+1$

This leads to the following algorithm:

1. Input $a \in \Re$
2. Compute a random number $r \in [0,1]$
3. If $a \geq 2$
4.     If $r \leq 0.880797077977882$ or $r \leq \sigma(a)$
5.         Output $+1$
6.     Else
7.         Output $-1$
8. Else
9.     If $r > .5 + a/4$
10.         Output $-1$
11.     Else if $r \leq .5 + a/5.252141128658$ or $r \leq \sigma(a)$
12.         Output $+1$
13.     Else
14.         Output $-1$

The idea is as follows:

1. If $a\geq 2$, then we can lower-bound the probability of outputting +1 by a pre-computed value of $\sigma(2)\approx0.8807...$, and short-circuit the computation in many cases.
2. If $a\leq 2$, then we can upper bound the sigmoid function by $.5+a/4$.
3. If $a\leq 2$, then we can also lower bound by $.5+a/5.252141...$ respectively.  (This constant was found numerically).

The three cases are illustrated in the following figure, where the input $a$ is on the x-axis, and the random number $r$ is on the y-axis.

Since, for all $a$ at least a fraction $\sigma(2)\approx.8807$ of the numbers will be short-circuited, sigmoid calls will be reduced appropriately.  If $a$ is often near zero, you will do even better.

Obviously, you can take this farther by adding more cases, which may or may not be helpful, depending on the architecture, and the cost of branching vs. the cost of computing an $\exp$.

# CRF Toolbox Updated

I updated the code for my Graphical Models / Conditional Random Fields toolbox This is a Matlab toolbox, though almost all the real work is done in compiled C++ for efficiency. The main improvements are:

• Lots of bugfixes.
• Various small improvements in speed.
• A unified CRF training interface to make things easier for those not training on images
• Binaries are now provided for Linux as well as OS X.
• The code for inference and learning using TRW is now multithreaded, using openmp.
• Switched to using a newer version of Eigen

There is also far more detailed examples, including a full tutorial of how to train a CRF to do “semantic segmentation” on the Stanford Backgrounds dataset. Just using simple color, position, and Histogram of Gradient features, the error rates are 23%, which appear to be state of the art (and better than previous CRF based approaches.) It takes about 90 minutes to train on my 8-core machine, and processes new frames in a little over a second each.

For fun, I also ran this model on a video of someone driving from Alexandria into Georgetown. You can see that the results are far from perfect but are reasonably good. (Notice it successfully distinguishes trees and grass at 0:12)

I’m keen to have others use the code (what with the hundreds of hours spent writing it), so please send email if you have any issues.

# Personal opinions about graphical models 1: The surrogate likelihood exists and you should use it.

When talking about graphical models with people  (particularly computer vision folks) I find myself advancing a few opinions over and over again.  So, in an effort to stop bothering people at conferences, I thought I’d write a few entries here.

The first thing I’d like to discuss is “surrogate likelihood” training.  (So far as I know, Martin Wainwright was the first person to give a name to this method.)

### Background

Suppose we want to fit a Markov random field (MRF).  I’m writing this as a generative model with an MRF for simplicity– pretty much the same story holds with a Conditional Random Field in the discriminative setting.

$p({\bf x}) = \frac{1}{Z} \prod_{c} \psi({\bf x}_c) \prod_i \psi(y_i)$

Here, the first product is over all cliques/factors in the graph, and the second is over all single variables.  Now, it is convenient to note that MRFs can be seen as members of the exponential family

$p({\bf x};{\boldsymbol \theta}) = \exp( {\boldsymbol \theta} \cdot {\bf f}({\bf x}) - A({\boldsymbol \theta}) )$,

where

${\bf f}({\bf X})=\{I[{\bf X}_{c}={\bf x}_{c}]|\forall c,{\bf x}_{c}\}\cup\{I[X_{i}=x_{i}]|\forall i,x_{i} \}$

is a function consisting of indicator functions for each possible configuration of each clique and variable, and the log-partition function

$A(\boldsymbol{\theta})=\log\sum_{{\bf x}}\exp\boldsymbol{\theta}\cdot{\bf f}({\bf x})$.

ensures normalization.

Now, the log-partition function has the very important (and easy to show) property that the gradient is the expected value of $\bf f$.

$\displaystyle \frac{dA}{d{\boldsymbol \theta}} = \sum_{\bf x} p({\bf x};{\boldsymbol \theta}) {\bf f}({\bf x})$

With a graphical model, what does this mean?  Well, notice that the expected value of, say, $I[X_i=x_i]$ will be exactly $p(x_i;{\boldsymbol \theta})$. Thus, the expected value of ${\bf f}$ will be a vector containing all univariate and clique-wise marginals.  If we write this as ${\boldsymbol \mu}({\boldsymbol \theta})$, then we have

$\displaystyle \frac{dA}{d{\boldsymbol \theta}} = {\boldsymbol \mu}({\boldsymbol \theta})$.

### The usual story

Suppose we want to do maximum likelihood learning.  This means we want to set ${\boldsymbol \theta}$ to maximize

$L( {\boldsymbol \theta} ) = \frac{1}{N}\sum_{\hat{{\bf x}}}\log p({\bf x};{\boldsymbol \theta})={\boldsymbol \theta}\cdot\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}})-A({\boldsymbol \theta}).$

If we want to use gradient ascent, we would just take a small step along the gradient.  This has a very intuitive form: it is the difference of the expected value of $\bf f$ under the model to the expected value of $\bf f$ under the current distribution.

$\displaystyle \frac{dL}{d{\boldsymbol \theta}} = \frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - \sum_{\bf x} p({\bf x};{\boldsymbol \theta}) {\bf f}({\bf x})$.

$\displaystyle \frac{dL}{d{\boldsymbol \theta}} = \frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - {\boldsymbol \mu}({\boldsymbol \theta})$.

Note the lovely property of moment matching here. If we have found a solution, then $dL/d{\boldsymbol \theta}=0$ and so the expected value of $\bf f$ under the current distribution will be exactly equal to that under the data.

Unfortunately, in a high-treewidth setting, we can’t compute the marginals.  That’s too bad.  However, we have all these lovely approximate inference algorithms (loopy belief propagation, tree-reweighted belief propagation, mean field, etc.).  Suppose we write the resulting approximate marginals as $\tilde{{\boldsymbol \mu}}({\boldsymbol \theta})$.  Then, instead of taking the above gradient step, why not instead just use

$\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - \tilde{{\boldsymbol \mu}}({\boldsymbol \theta})$?

That’s all fine!  However, I often see people say/imply/write some or all of the following:

1. This is not guaranteed to converge.
2. There is no longer any well-defined objective function being maximized.
3. We can’t use line searches.
4. We have to use (possibly stochastic) gradient ascent.
5. This whole procedure is frightening and shouldn’t be mentioned in polite company.

I agree that we should view this procedure with some suspicion, but it gets far more than it deserves! The first four points, in my view, are simply wrong.

### What’s missing

The critical thing that is missing from the above story is this: Approximate marginals come together with an approximate partition function!

That is, if you are computing approximate marginals $\tilde{{\boldsymbol \mu}}({\boldsymbol \theta})$ using loopy belief propagation, mean-field, or tree-reweighted belief propagation, there is a well-defined approximate log-partition function $\tilde{A}({\boldsymbol \theta})$ such that

$\displaystyle \tilde{{\boldsymbol \mu}}({\boldsymbol \theta}) = \frac{d\tilde{A}}{d{\boldsymbol \theta}}$.

What this means is that you should think, not of approximating the likelihood gradient, but of approximating the likelihood itself. Specifically, what the above is really doing is optimizing the “surrogate likelihood”

$\tilde{L}({\boldsymbol \theta}) = {\boldsymbol \theta}\cdot\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}})-\tilde{A}({\boldsymbol \theta}).$

What’s the gradient of this? It is

$\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - \tilde{{\boldsymbol \mu}}({\boldsymbol \theta}),$

or exactly the gradient that was being used above. The advantage of doing things this way is that it is a normal optimization.  There is a well-defined objective. It can be plugged into a standard optimization routine, such as BFGS, which will probably be faster than gradient ascent.  Line searches guarantee convergence. $\tilde{A}$ is perfectly tractable to compute. In fact, if you have already computed approximate marginals, $\tilde{A}$ has almost no cost. Life is good.

The only counterargument I can think of is that mean-field and loopy BP can have different local optima, which might mean that a no-line-search-refuse-to-look-at-the-objective-function-just-follow-the-gradient-and-pray style optimization could be more robust, though I’d like to see that argument made…

I’m not sure of the history, but I think part of the reason this procedure has such a bad reputation (even from people that use it!) might be that it predates the “modern” understanding of inference procedures as producing approximate partition functions as well as approximate marginals.