# A simple explanation of reverse-mode automatic differentiation

My previous rant about automatic differentiation generated several requests for an explanation of how it works. This can be confusing because there are different types of automatic differentiation (forward-mode, reverse-mode, hybrids.) This is my attempt to explain the basic idea of reverse-mode autodiff as simply as possible.

Reverse-mode automatic differentiation is most attractive when you have a function that takes $n$ inputs $x_1,x_2,...,x_n$, and produces a single output $x_N$. We want the derivatives of that function, $\displaystyle{\frac{d x_N}{d x_i}}$, for all $i$.

Point #1: Any differentiable algorithm can be translated into a sequence of assignments of basic operations.

Forward-Prop

for $i=n+1,n+2,...,N$

$x_i \leftarrow f_i({\bf x}_{\pi(i)})$

Here, each function $f_i$ is some very basic operation (e.g. addition, multiplication, a logarithm) and $\pi(i)$ denotes the set of “parents” of $x_i$. So, for example, if $\pi(7)=(2,5)$ and $f_7 = \text{add}$, then $x_7 = x_2 + x_5$.

It would be extremely tedious, of course, to actually write an algorithm in this “expression graph” form. So, autodiff tools create this representation automatically from high-level source code.

Point #2: Given an algorithm in the previous format, it is easy to compute its derivatives.

The essential point here is just the application of the chain rule.

$\displaystyle{ \frac{d x_N}{d x_i} = \sum_{j:i\in \pi(j)} \frac{d x_N}{d x_j}\frac{\partial x_j}{\partial x_i}}$

Applying this, we can compute all the derivatives in reverse order.

Back-Prop

$\displaystyle{ \frac{d x_N}{d x_N} \leftarrow 1}$

for $i=N-1,N-2,...,1$

$\displaystyle{ \frac{d x_N}{d x_i} \leftarrow \sum_{j:i\in \pi(j)} \frac{d x_N}{d x_j}\frac{\partial f_j}{\partial x_i}}$

That’s it!  Just create an expression graph representation of the algorithm and differentiate each basic operation $f_i$ in reverse order using calc 101 rules.

Other stuff:

• No, this is not the same thing as symbolic differentiation.  This should be obvious:  Most algorithms don’t even have simple symbolic representations.  And, even if yours does,  it is possible that it “explodes” upon symbolic differentiation.  As a contrived example, try computing the derivative of $\exp(\exp(\exp(\exp(\exp(\exp(x))))))$.
• The complexity of the back-prop step is the same as the forward propagation step.
• In machine learning, functions from N inputs to one output come up all the time:  The N inputs are parameters defining a model, and the 1 output is a loss, measuring how well the model fits training data.  The gradient can be fed into an optimization routine to fit the model to data.
• There are two common ways of implementing this:
1. Operator Overloading.  One can create a new variable type that has all the common operations of numeric types, which automatically creates an expression graph when the program is run.   One can then call the back-prop routine on this expression graph.  Hence,  one does not need to modify the program, just replace each numeric type with this new type.  This is fairly easy to implement, and very easy to use.  David Gay‘s RAD toolbox for C++ is a good example, which I use all the time.
The major downside of operator overloading is efficiency:  current compilers will not optimize the backprop code.  Essentially, this  step is interpreted.  Thus, one finds in practice a non-negligible overhead of, say, 2-15 times the complexity of the original algorithm using a native numeric type.  (The overhead depends on how much the original code benefits from compiler optimizations.)
2. Source code transformation. Alternatively, one could write a program that examines the source code of the original program, and transforms this into source code computing the derivatives.  This is much harder to implement, unless one is using a language like Lisp with very uniform syntax.  However, because the backprop source code produced is then optimized like normal code, it offers the potential of zero overhead compared with manually computed derivatives.
• If it isn’t convenient to use automatic differentiation, one can also use “manual automatic differentation”.  That is, to compute the derivatives, just attack each intermediate value your algorithm computes, in reverse order.
• Some of the most interesting work on autodiff comes from Pearlmutter and Siskind, who have produced a system called Stalingrad for a subset of scheme that allows for crazy things like taking derivatives of code that itself is taking derivates.  (So you can, for example, produce Hessians.)  I think they wouldn’t mind hearing from potential users.

# Hessian-Vector products

You have some function $f({\bf x})$.  You have figured out how to compute it’s gradient, $g({\bf x})=\frac{\partial}{\partial \bf x}f({\bf x})$.  Now, however, you find that you are implementing some algorithm (like, say, Stochastic Meta Descent), and you need to compute the product of the Hessian $H({\bf x})=\frac{\partial^2}{\partial {\bf x}\partial{\bf x}^T}f({\bf x})$ with certain vectors.  You become very upset, because either A) you don’t feel like deriving the Hessian (probable), or B) the Hessian has $N^2$ elements, where $N$ is the length of $\bf x$ and that is too big to deal with (more probable).  What to do?  Behold:

${\bf g}({\bf x}+{\bf \Delta x}) \approx {\bf g}({\bf x}) + H({\bf x}){\bf \Delta x}$

Consider the Hessian-Vector product we want to compute, $H({\bf x}){\bf v}$.  For small $r$,

${\bf g}({\bf x}+r{\bf v}) \approx {\bf g}({\bf x}) + r H({\bf x}){\bf v}$

And so,

$\boxed{H({\bf x}){\bf v}\approx\frac{{\bf g}({\bf x}+r{\bf v}) - {\bf g}({\bf x})}{r}}$

This trick above has been around apparently forever.  The approximation becomes exact in the limit $r \rightarrow 0$.  Of course, for small $r$, numerical problems will also start to kill you.  Pearlmutter’s algorithm is a way to compute $H {\bf v}$ with the same complexity, with out suffering rounding errors.  Unfortunately, Pearlmutter’s algorithm is kind of complex, while the above is absolutely trivial.

Update: Perlmutter himself comments below that if we want to use the finite difference trick, we would do better to use the approximation:

$\boxed{H({\bf x}){\bf v}\approx\frac{{\bf g}({\bf x}+r{\bf v}) - {\bf g}({\bf x}-r{\bf v})}{2r}}.$

This expression will be closer to the true value for larger $r$, meaning that we are less likely to get hurt by rounding error. This is nicely illustrated on page 6 of these notes.

# Why does regularization work?

When fitting statistical models, we usually need to “regularize” the model. The simplest example is probably linear regression. Take some training data, ${({\hat {\bf x}}, {\hat y})}$. Given a vector of weights $\bf w$, the total squared distance is

$\sum_{ ({\hat {\bf x}}, {\hat y})} ({\bf w}^T{\hat {\bf x}} - {\hat y})^2$

So to fit the model, we might find ${\bf w}$ to minimize the above loss. Commonly, (particularly when ${\bf x}$ has many dimensions), we find t\hat the above procedure leads to overfitting: very large weights ${\bf w}$ t\hat fit the training data very well, but poorly predict future data. “Regularization” means modifying the optimization problem to prefer small weights. Consider

$\sum_{ ({\hat {\bf x}}, {\hat y})} ({\bf w}^T{\hat {\bf x}} - {\hat y})^2 + \lambda ||{\bf w}||^2.$

Here, $\lambda$ trades off how much we care about the model fitting well versus how we care about ${\bf w}$ being small. Of course, this is a much more general concept that linear regression, and one could pick many alternatives to the squared norm to measure how simple a model is. Let’s abstract a bit, and just write some function $M$ to represent how well the model fits the data, and some function $R$ to represent how simple it is.

$M({\bf w}) + \lambda R({\bf w})$

The obvious question is: how do we pick $\lambda$? This, of course, has been the subject of much theoretical study. What is commonly done in practice is this:

1. Divide the data into a training set and test set.
2. Fit the model for a range of different $\lambda$s using only the training set.
3. For each of the models fit in step 2, check how well the resulting weights fit the test data.
4. Output the weights that perform best on test data.

(One can also retrain on all the data using the $\lambda$ that did best in step 2.)

Now, there are many ways to measure simplicity. (In the example above, we might consider $||{\bf w}||^2$, $||{\bf w}||$, $||{\bf w}||_1$, $||{\bf w}||_0$, …). Which one to use? If you drank too much coffee one afternoon, you might decide to include a bunch of regulizers simultaneously, each with a corresponding regularization parameter, ending up with a problem like

$M({\bf w}) + \sum_i \lambda_i R_i({\bf w})$.

And yet, staring at that equation, things take a sinister hue: if we include too many regularization parameters, won’t we eventually overfit the regularization parameters? And furthermore, won’t it be very hard to test all possible combinations of $\lambda_i$ to some reasonable resolution? What should we do?

And now I will finally get to the point.

Recently, I’ve been looking at regularization in a different way, which seems very obvious in retrospect. The idea is that when searching for the optimal regularization parameters, we are fitting a model– a model that just happens to be defined in an odd way.

First, let me define some notation. Let $T$ denote the training set, and let $S$ denote the test set. Now, define

${\bf f}(\lambda) = \arg\min_{\bf w} M_T({\bf w}) + \lambda R({\bf w})$

Where $M_T$ is the function measuring how well the model fits the training data. So, for a given regularization parameter, ${\bf f}$ returns the weights solving the optimization problem. Given that, the optimal \lambda will the the one minimizing this equation:

$\lambda^* = \arg\min_\lambda M_S( {\bf f}(\lambda) )$

This extends naturally to the setting where we have a vector of regularization parameters ${\bf \lambda}$.

${\bf f}({\boldsymbol \lambda}) = \arg\min_{\bf w} M_T({\bf w}) + \sum_i \lambda_i R_i({\bf w})$

${\boldsymbol \lambda}^* = \arg\min_{\bf \lambda} M_S( {\bf f}({\boldsymbol \lambda}) )$

From this perspective, doing an optimization over regularization parameters makes sense: it is just fitting the parameters $\boldsymbol \lambda$ to the data $S$— it just so happens that the objective function is implicitly defined by a minimization over the data $T$.

Regularization works because it is fitting a single parameter to some data. (In general, one is safe fitting a single parameter to any reasonably sized dataset although there are exceptions.) Fitting multiple regularization parameters should work, supposing the test data is big enough to support them. (There are computational issues I’m not discussing here with doing this, stemming from the fact that ${\bf f}({\boldsymbol \lambda})$ isn’t known in closed form, but is implicitly defined by a minimization. So far as I know, this is an open problem, although I suspect the solution is “implicit differentiation”)

Even regularizing the regularization parameters isn’t necessarily a crazy idea, though clearly $||{\boldsymbol \lambda}||^2$ isn’t the way to do it. (Simpler models correspond to larger regularization constants. Maybe $1/||{\boldsymbol \lambda}||^2$ ?.) Then, you can regularize those parameters!

Well. In retrospect that wasn’t quite so simple as it seemed.

It’s a strong bet that this is a standard interpretation in some communities.

Update: The idea of fitting multiple parameters using implicit differentiation is published in the 1996 paper Adaptive regularization in neural network modeling by Jan Larsen, Claus Svarer, Lars Nonboe Andersen and Lars Kai Hansen.

# Linear Classifiers and Loss Functions

A linear classifier is probably the simplest machine learning technique. Given some input vector $\bf x$, predict some output $y$. One trains a vector of “weights” $\bf w$, that determine the behavior of the classifier. Given some new input $\bf x$, the prediction $f$ will be:

$f({\bf x}) = I[{\bf w}^T {\bf x}>0]$

Here, $I[\text{expr}]$ is the indicator function– $1$ if $\text{expr}$ is true, and -1 otherwise. So, $y$ is one of two classes.

Often the classifier will be written like $I[{\bf w}^T {\bf x}+b>0]$, i.e. including a “bias” term $b$. Here I’ll ignore this– if you want a bias term, just append a $1$ to every vector $\bf x$.

Now, it seems like nothing could be simpler than a linear classifier, but there is considerable subtlety in how to train a linear classifier. Given some training data $\{({\hat {\bf x}},{\hat y})\}$, how do we find the vector of weights $\bf w$?

There are various popular ways of doing this. Three popular techniques are:

• The Perceptron Algorithm
• Logistic Regression
• Support Vector Machines (Hinge loss)

The perceptron algorithm works, roughly speaking as follows: Go through the training data, element by element. If one datum is misclassified, slightly nudge the weights so that it is closer to being correct. This algorithm has the remarkable property that, if the training data is linearly separable, it will find a $\bf w$ that correctly classifies it in a finite number of iterations. (The number of iterations that it takes depends on the “how separable” the data is, making fewer mistakes on data with a larger margin.)

Unfortunately, the assumption that the training data is linearly separable is basically never true in practice. There is still a bound in terms of how much points would need to be moved in order to be linearly separable, but it isn’t as strong as we might like. (I don’t know what the folklore is for how this tends to work in practice.)

Logistic regression and support vector machines take a different approach. There, a “loss function” is defined in terms of the weights, and one just optimizes the loss to find the best weights. (I’m ignoring regularization here for simplicity.) For logistic regression, the loss is

$L = -\sum_{({\hat {\bf x}},{\hat y})} \log p({\hat y}|{\hat {\bf x}})$

where $p(y=1|{\bf x})=\sigma({\bf w}^T {\bf x})$, for a sigmoid function $\sigma$, and $p(y=-1|{\bf x})=1-p(y=1|{\bf x})$.

For a support vector machine, the loss is

$L = \sum_{({\hat {\bf x}},{\hat y})} (1-{\hat y} \cdot {\bf w}^T {\hat {\bf x}})_+$

where $(a)_+$ is $a$ if $a>0$ and 0 otherwise. This is a hinge loss. Notice that it will be zero if ${\hat y} \cdot {\bf w}^T {\hat {\bf x}} > 1$, or if that particular training element is comfortably on the correct side of the decision boundary. Otherwise, the “pain” is proportional to “how wrong” the classification is.

Both the hinge-loss and logistic regression are convex loss functions. This means that if we apply a nonlinear search procedure to find a local minima, that will also be the global minima.

These different loss functions are compared here.

WHY NOT DO THE OBVIOUS THING?

Now, the critical point about these three above methods is that none of them do the seemingly obvious thing: find the vector of weights ${\bf w}$ that has the lowest classification error on the training data. Why not? Some would defend logistic regression or SVM for theoretical reasons (namely a meaningful probabilistic interpretation, and the reasonableness and theoretical guarantees of max-margin methods, respectively).

However, probably the more significant hurdle is computational considerations. Namely, the problem of finding the weights with lowest classification error is (in order of increasing horribleness) non-convex, non-differentiable, and NP-hard.

In fact it is NP-hard even to find an approximate solution, with worst-case guarantees. (See citation 2 below, which, interestingly, gives an approximate algorithm in terms of a property of the data.)

Nevertheless, if classification error is what we want, I don’t see how it makes sense to minimize some alternative loss function. As such, I decided today to try the following algorithm.

1. Apply an SVM to get an initial solution.
2. Apply heuristic search to minimize classification error, initialized to the solution of step 1.

Now, I am well aware of the problems with non-convex optimization. However, simply the fact that logistic regression or hinge loss is convex is not an argument in their favor. If theoretical considerations dictate that we minimize classification error, just substituting a different loss, and then refusing to look at the classification rate is highly questionable. Sure, that can lead to a convex optimization problem, but at the that’s because a different problem is being solved! The goal is accurate prediction of future data, not accurate minimization of a loss on the training data. If we use our best convex approximation to initialize the heuristic optimization of the loss we really want, we will never do worse.

EXPERIMENT

There are some heuristics available for doing this (Reference 3), but they seem a little expensive. I decided to try something really simple.

1. Fit a classifier (by hinge loss or logistic regression).
2. Throw out a few of the most badly misclassified points.
3. Repeat until all remaining points are correctly classified.

The idea is that, by “giving up” on badly misclassified points, the boundary might be movable into a position where it correctly classifies other points. There is no guarantee in general that this will find a linear classifier with a lower misclassification rate, but it should not do worse.

To test this out, I got some data from the MNIST handwritten digit database. To make the problem harder, I took one class to be random images from either the set of 1’s or 2’s, while the other class was 3’s or 4’s. The images were downsampled to 7×7, and a constant of one added. So we are looking for a linear classifier in a 50 dimensional space.

The data was trained on a database of 10,000 examples, with a test set of 10,000 examples.

Here are the results where we throw out the worst 100 training points in step 2:

Here are the results when we throw out 20 at a time:

Here are the results when we throw out 5 at a time:

There seems to be a trade-off in step 2: Fewer models need to be fit if we throw out more points at a time. However, this seems to come with a small penalty in terms of terms of the classification error on the final model.

So, in summary– a drop in classification error on test data from .941 to .078. Thats a 17% drop. (Or a 21% drop, depending upon which rate you use as a base.) This from a method that you can implement in basically zero extra work if you already have a linear classifier. Seems worth a try.

References:

[2] Efficient Learning of Linear Perceptrons by Shai Ben-david, Hans Ulrich Simo

[3] Optimizing 0/1 Loss for Perceptrons by Random Coordinate Descent by L. Li and H.-T. Lin