Update: (November 2015) In the almost seven years since writing this, there has been an explosion of great tools for automatic differentiation and a corresponding upsurge in its use. Thus, happily, this post is more or less obsolete.
I recently got back reviews of a paper in which I used automatic differentiation. Therein, a reviewer clearly thought I was using finite difference, or “numerical” differentiation. This has led me to wondering: Why don’t machine learning people use automatic differentiation more? Why don’t they use it…constantly? Before recklessly speculating on the answer, let me briefly review what automatic differentiation (henceforth “autodiff”) is. Specifically, I will be talking about reverse-mode autodiff.
(Here, I will use “subroutine” to mean a function in a computer programming language, and “function” to mean a mathematical function.)
It works like this:
- You write a subroutine to compute a function . (e.g. in C++ or Fortran). You know to be differentiable, but don’t feel like writing a subroutine to compute .
- You point some autodiff software at your subroutine. It produces a subroutine to compute the gradient.
- That new subroutine has the same complexity as the original function!
- It does not depend on the dimensionality of .
- It also does not suffer from round-off errors!
To take a specific example, suppose that you have written a subroutine to evaluate a neural network. If you perform autodiff on that subroutine, it will produce code that is equivalent to the backpropagation algorithm. (Incidentally, autodiff significantly predates the invention of backprop).
Why should autodiff be possible? It is embarasingly obvious in retrospect: Complex subroutines consist of many elementary operations. Each of those is differentiable. Apply the calc 101 chain rule to the expression graph of all these operations.
A lot of papers in machine learning (including many papers I like) go like this:
- Invent a new type of model or a new loss function
- Manually crank out the derivatives. (The main technical difficulty of the paper.)
- Learn by plugging the derivatives into an optimization procedure. (Usually L-BFGS or stochastic gradient.)
- Experimental results.
It is bizarre that the main technical contribution of so many papers seems to be something that computers can do for us automatically. We would be better off just considering autodiff part of the optimization procedure, and directly plugging in the objective function from step 1. In my opinion, this is actually harmful to the field. Before discussing why, let’s consider why autodiff is so little used:
- People don’t know about it.
- People know about it, but don’t use it because they want to pad their papers with technically formidable derivations of gradients.
- People know about it, but don’t use it for some valid reasons I’m not aware of.
I can’t comment on (3) by definition. I think the answer is (1), though I’m not sure. Part of the problem is that “automatic differentiation” sounds like something you know, even if you actually have no idea what it is. I sometimes get funny looks from people when I claim that both of the following are true.
Further evidence for (1) is that many papers, after deriving their gradient, proclaim that “this algorithm for computing the gradient is the same order of complexity as evaluating the objective function,” seeming to imply it is fortunate that a fast gradient algorithm exists.
Now, why bother complaining about this? Are those manual derivatives actually hurting anything? A minor reason is that it is distracting. Important ideas get obscured by the details, and valuable paper space (often a lot of space) is taken up for the gradient derivation rather than more productive uses. Similarly, valuable researcher time is wasted.
In my view a more significant downside is that the habit of manually deriving gradients restricts the community to only using computational structures we are capable of manually deriving gradients for. This is a huge restriction on the kinds of computational machinery that could be used, and the kinds of problems that could be addressed.
As a simple example, consider learning parameters for some type of image filtering algorithm. Imaging problems suffer from boundary problems. It is not possible to compute a filter response at the edge of an image. A common trick is to “flip” the image over the boundary to provide the nonexistent measurements. This poses no difficulty at all to autodiff, but borders on impossible to account for analytically. Hence, I suspect, people simply refrain from researching these types of algorithms. That is a real cost.
- Often, derivatives are needed for analysis, not just to plug into an optimization routine. Of course, these derivatives should always remain.
- When derivatives are reasonably simple, it can be informative to see the equations. Oftentimes, however, an algorithm is derived for computing the gradient. In this cases, it would almost always be easier for someone trying to implement the method to install an autodiff tool than try to implement the gradient algorithm.
Update: Answers to some questions:
Q) What’s the difference between autodiff and symbolic diff?
A) They are totally different. The biggest difference is that autodiff can differentiate algorithms, not just expressions. Consider the following code:
function f(x) y = x; for i=1...100 y = sin(x+y); return y
Automatic differentiation can differentiate that, easily, in the same time as the original code. Symbolic differentiation would lead to a huge expression that would take much more time to compute.
Q) What about non-differentiable functions?
A) No problem, as long as the function is differentiable at the place you try to compute the gradient.
Q) Why don’t you care about convexity?
A) I do care about convexity, of course.