Automatic differentiation is a classic numerical method that takes a program, and (with minimal programmer effort) computes the derivatives of that program. This is very useful because, when optimizing complex functions, a lot of time tends to get spent manually deriving and then writing code for derivatives. Some systems like cvx do a great job of recognizing simple problem types (linear programs, cone programs, etc.), but can’t handle arbitrary functions.

At present, automatic differentiation involves a degree of pain. Typically, the user needs to write C or C++ or Fortran code and augment the code with “taping” commands to get it to work. There is also usually a significant computational overhead.

In a just world, there would be no pain, and there would be no computational overhead. In a just world, reverse-mod autodiff would work like this:

**STEP 1** The user writes the function they want to optimize in a convenient high level language like Python.

This is easy, but slow, and provides no derivatives.

Consider a function that sums the elements in a matrix product. The user (apparently unaware that numpy provides matrix multiplication) writes:

def fun(A,B): rez = 0 for i in range(A.shape[0]): for j in range(B.shape[1]): rez += dot(A[i,:],B[:,j]) return rez

**STEP 2** The user feeds their high level function into a library. This library uses operator overloading magic to build an expression graph representation of the function. This expression graph is then used to generate efficient machine ode for both the function, and its derivatives.

The user merely should have to do something like

A = numpy.random.rand(3,5) B = numpy.random.rand(5,4) dfun = compile_fun(fun,A,B)

They could then compute the function and derivatives (all in compiled code) using

F,[dFdA,dFdB] = dfun(A,B)

Thus, the user gets everything, with out having to compromise: maximum performance, and maximum convenience.

Since this is how things *should* work, I spent some time this past summer writing a library that does. The above code is actually a working example, calling the `compile_fun`

method– the only method a consumer of the library needs to know about. This library comes tantalizingly close to my goals.

Another simple example, mixing matrices and scalars:

def fun(A,B,a): return sum(sum((A*a-B)**2)) a = 2. A = random.rand(15,22) B = random.rand(15,22) import etree dfun = etree.compile_fun(fun,A,B,a) F,[dFdA,dFdB,dFda] = dfun(A,B,a)

Consider the above matrix multiplication example, with 60×60 matrices. The library can generate the expression graph, transform that into a simple bytecode, than transform that bytecode into C code *acceptably* quickly– 11.97 seconds on my machine. The code then runs very fast: .0086 seconds as opposed to .0254 seconds for just running the original function or .0059 seconds for calling numpy’s matrix multiplication routine `dot(A,B)`

. (Which, of course, does not provide derivatives!)

There is, inevitably, one large problem: horrific compilation times on large functions. To take the C code and transform it into machine code, gcc takes **1116 seconds**. Why, you ask? The reason is because gcc is trying to compile a single **36.5 MB** function:

#include "math.h" void rock(double* A, double* D){ A[12800]=A[0] * A[6400]; A[12801]=A[1] * A[6480]; A[12802]=A[12800] + A[12801]; A[12803]=A[2] * A[6560]; A[12804]=A[12802] + A[12803]; A[12805]=A[3] * A[6640]; // thousands of pages of same D[12801] = D[12802]; D[1] += D[12801] * A[6480]; D[6480] += D[12801] * A[1]; D[0] += D[12800] * A[6400]; D[6400] += D[12800] * A[0]; }

Though this is very simple code, it seems that gcc still uses some super-linear algorithms, so it can’t handle large functions.

(To be clear, for small or medium sized problems, the code compiles very quickly.)

Now, clearly, I am doing something wrong. Frankly I am writing this largely in the hope that someone who actually knows something about compilers and programming languages can enlighten me.

1) Use a better compiler. I’ve tried this. I can turn off gcc’s optimization, which decreases compilation times, though to a still very slow level. Alternatively, I could use a fast compiler. tcc flies right through the code. Unfortunately, the machine code that tcc generates is very slow– I think due to poor register allocation.

2) Don’t generate intermediate C code– directly generate machine code or assembly. This would certainly work, but there is no way I am up to it. Maybe it would make sense to target the JVM?

3) Use one of the newfangled just in time virtual machine things, like LLVM or libJIT. This seems promising, but after quite a bit of research I’m still really unsure about how well this is likely to work, or how to proceed.

4) Write a fast interpreter, and just pass the bytecode to this. I’ve tried this as well. This is basically the strategy used by some autodiff packages, e.g. cppad. This can get reasonable performance, but will never compete with native code. The reason is– I think– that the interpreter is basically a giant switch statement, and all the branching screws up the pipeline on the CPU.

Anyway, the code is available as etree.py, and a tiny file mycode.h. (Let’s say it’s available under the MIT license, and the condition that you not laugh too much at my programming skills.) To try it out, just start python and do

import etree etree.runtests()

At this point, the library *is* useful (I use it!) but only for relatively small problems, containing no more than, say, a few thousand variables.

If anyone has ideas for better compiler flags, or knows about the feasibility of targeting LLVM or libJIT instead of C code, please get in touch.

Disclaimer: There are a bunch of minor issues with the library. These are all fixable, but it isn’t really worth the effort unless the compilation times can be dealt with.

- The function cannot branch. (No if statements).
- If you want to multiply a matrix by a scalar, the scalar must come after the matrix.
- I know it works on 32 bit ubuntu with everything installed via synaptic package manager, and on 64 bit OS X, simply using Sage (with this patch to make scipy/weave work). If you want to play around with this, but aren’t a python programmer, I would suggest using Sage.

You might be interested in theano: http://deeplearning.net/software/theano/

Thank you– theano looks very cool.

Autodiff plus automatic numerical hardening plus native code compilation, plus targeting the GPU? Sounds almost too good to be true…

Just a note: it’s unclear whether you’re using identities for derivatives of matrices directly, or treating arrays as sets of scalars which interact. There have been systems such as Coconut which work directly with matrix expressions and acheive good results. (The code is obviously not usable in your situation, but it’s may be an interesting approach.)

I treat the matrices as sets of interacting scalars. (This is what leads to the gigantic c source files.) Coconut looks very attractive (I’ve been interested in trying out C# or F# for a while, but was scared off by the lack of libraries), but I can’t actually find if the code is available online anywhere.

I’ve been thinking about this issue too, but the direction is a little bit different. My semi-conclusion is that we need some support from the underlying linear algebra library/language to cleanly deal with those branches/higher level algebraic manipulations(matrix mult.). Even if we can compile thousands of pages of …A[12801]=A[1] * A[6480];… stuff quickly, not sure it is an ideal approach, since the assembly code size can be really huge, depending on N.

Has anyone looked at automatic differentiation of randomized algorithms ? At some point these algorithms call a random number generator. It would interesting to track the first few moments of the derived differential. Just the mean and variance by itself will be very useful

You might also be interested in CasADi: casadi.org. It implements pretty much what you suggest above. It is mainly intended for numerical optimization. The AD implementation is very general and supports differentiation of algorithms consisting of sparse matrix-valued operations. The evaluation is done either in fast interpretors or in generated C-code. We’ve also experimented with generating LLVM, but gave up that approach.