Matrix Calculus

Based on a lot of requests from students, I did a lecture on matrix calculus in my machine learning class today. This was based on Minka’s Old and New Matrix Algebra Useful for Statistics and Magnus and Neudecker’s Matrix Differential Calculus with Applications in Statistics and Econometrics.

In making the notes, I used a couple innovations, which I am still debating the wisdom of. The first is the rule for calculating derivatives of scalar-valued functions of a matrix input f(X). Traditionally, this is written like so:

if dy = \text{tr}(A^T dX) then \frac{dy}{dX} = A.

I initially found the presence of the trace here baffling. However, there is the simple rule that

\text{tr}(A^T B) = A \cdot B

where \cdot is the matrix inner product. This puts the rule in the much more intuitive (to me!) form:

if dy = A \cdot dX then \frac{dy}{dX} = A.

This seems more straightforward, but it comes at a cost. When working with the rule in the trace form, one often needs to do quite a bit of shuffling around of matrices. This is easy to do using the standard trace identities like


If we are to work with inner-products, we will require a similar set of rules. It isn’t too hard to show that there are “dual” identities like

A \cdot (BC) = B \cdot (AC^T) = C \cdot (B^T A)

which allow a similar shuffling with dot products. Still, these are certainly less easy to remember.

There are also a set of other rules that seem to be needed in practice, but aren’t included in standard texts. For example, if R is a function that is applied elementwise to a matrix or vector (e.g. \sin), then

d(R(F)) = R'(F(X)) \odot dF

where \odot is the elementwise product. This then requires other (very simple) identities for getting rid of the elementwise product, such as

{\bf x}\odot{\bf y} = \text{diag}({\bf x}) {\bf y} = \text{diag}({\bf y}) {\bf x}.

Another issue with using dot products everywhere is the need to constantly convert between transposes and inner-products. (This issue comes up because I prefer a “all vectors are column vectors” convention) The never-ending debate of if we should write

{\bf x} \cdot {\bf y}


{\bf x}^T {\bf y}

seems to have particular importance here, and I’m not sure of the best choice.


Automatic Differentiation Without Compromises

Automatic differentiation is a classic numerical method that takes a program, and (with minimal programmer effort) computes the derivatives of that program. This is very useful because, when optimizing complex functions, a lot of time tends to get spent manually deriving and then writing code for derivatives. Some systems like cvx do a great job of recognizing simple problem types (linear programs, cone programs, etc.), but can’t handle arbitrary functions.

At present, automatic differentiation involves a degree of pain. Typically, the user needs to write C or C++ or Fortran code and augment the code with “taping” commands to get it to work. There is also usually a significant computational overhead.

In a just world, there would be no pain, and there would be no computational overhead. In a just world, reverse-mod autodiff would work like this:

STEP 1 The user writes the function they want to optimize in a convenient high level language like Python.

This is easy, but slow, and provides no derivatives.

Consider a function that sums the elements in a matrix product. The user (apparently unaware that numpy provides matrix multiplication) writes:

    def fun(A,B):
        rez = 0
        for i in range(A.shape[0]):
            for j in range(B.shape[1]):
                rez += dot(A[i,:],B[:,j])
        return rez

STEP 2 The user feeds their high level function into a library. This library uses operator overloading magic to build an expression graph representation of the function. This expression graph is then used to generate efficient machine ode for both the function, and its derivatives.

The user merely should have to do something like

    A = numpy.random.rand(3,5)
    B = numpy.random.rand(5,4)
    dfun = compile_fun(fun,A,B)

They could then compute the function and derivatives (all in compiled code) using

    F,[dFdA,dFdB] = dfun(A,B)

Thus, the user gets everything, with out having to compromise: maximum performance, and maximum convenience.

Since this is how things should work, I spent some time this past summer writing a library that does. The above code is actually a working example, calling the compile_fun method– the only method a consumer of the library needs to know about.  This library comes tantalizingly close to my goals.

Another simple example, mixing matrices and scalars:

def fun(A,B,a):
    return sum(sum((A*a-B)**2))
a = 2.
A = random.rand(15,22)
B = random.rand(15,22)
import etree
dfun = etree.compile_fun(fun,A,B,a)
F,[dFdA,dFdB,dFda] = dfun(A,B,a)

Consider the above matrix multiplication example, with 60×60 matrices. The library can generate the expression graph, transform that into a simple bytecode, than transform that bytecode into C code acceptably quickly– 11.97 seconds on my machine.  The code then runs very fast:  .0086 seconds as opposed to .0254 seconds for just running the original function or .0059 seconds for calling numpy’s matrix multiplication routine dot(A,B). (Which, of course, does not provide derivatives!)

There is, inevitably, one large problem:  horrific compilation times on large functions.  To take the C code and transform it into machine code, gcc takes 1116 seconds.  Why, you ask?  The reason is because gcc is trying to compile a single 36.5 MB function:

#include "math.h"
void rock(double* A, double* D){
A[12800]=A[0] * A[6400];
A[12801]=A[1] * A[6480];
A[12802]=A[12800] + A[12801];
A[12803]=A[2] * A[6560];
A[12804]=A[12802] + A[12803];
A[12805]=A[3] * A[6640];
// thousands of pages of same
D[12801] = D[12802];
D[1] += D[12801] * A[6480];
D[6480] += D[12801] * A[1];
D[0] += D[12800] * A[6400];
D[6400] += D[12800] * A[0];

Though this is very simple code, it seems that gcc still uses some super-linear algorithms, so it can’t handle large functions.

(To be clear, for small or medium sized problems, the code compiles very quickly.)

Now, clearly, I am doing something wrong. Frankly I am writing this largely in the hope that someone who actually knows something about compilers and programming languages can enlighten me.

1) Use a better compiler.  I’ve tried this.  I can turn off gcc’s optimization, which decreases compilation times, though to a still very slow level.  Alternatively, I could use a fast compiler.  tcc flies right through the code.  Unfortunately, the machine code that tcc generates is very slow– I think due to poor register allocation.

2) Don’t generate intermediate C code– directly generate machine code or assembly.  This would certainly work, but there is no way I am up to it. Maybe it would make sense to target the JVM?

3) Use one of the newfangled just in time virtual machine things, like LLVM or libJIT.  This seems promising, but after quite a bit of research I’m still really unsure about how well this is likely to work, or how to proceed.

4) Write a fast interpreter, and just pass the bytecode to this.  I’ve tried this as well.  This is basically the strategy used by some autodiff packages, e.g. cppad. This can get reasonable performance, but will never compete with native code. The reason is– I think– that the interpreter is basically a giant switch statement, and all the branching screws up the pipeline on the CPU.

Anyway, the code is available as, and a tiny file mycode.h. (Let’s say it’s available under the MIT license, and the condition that you not laugh too much at my programming skills.) To try it out, just start python and do

import etree

At this point, the library is useful (I use it!) but only for relatively small problems, containing no more than, say, a few thousand variables.

If anyone has ideas for better compiler flags, or knows about the feasibility of targeting LLVM or libJIT instead of C code, please get in touch.

Disclaimer: There are a bunch of minor issues with the library. These are all fixable, but it isn’t really worth the effort unless the compilation times can be dealt with.

  • The function cannot branch. (No if statements).
  • If you want to multiply a matrix by a scalar, the scalar must come after the matrix.
  • I know it works on 32 bit ubuntu with everything installed via synaptic package manager, and on 64 bit OS X, simply using Sage (with this patch to make scipy/weave work). If you want to play around with this, but aren’t a python programmer, I would suggest using Sage.

A simple explanation of reverse-mode automatic differentiation

My previous rant about automatic differentiation generated several requests for an explanation of how it works. This can be confusing because there are different types of automatic differentiation (forward-mode, reverse-mode, hybrids.) This is my attempt to explain the basic idea of reverse-mode autodiff as simply as possible.

Reverse-mode automatic differentiation is most attractive when you have a function that takes n inputs x_1,x_2,...,x_n, and produces a single output x_N. We want the derivatives of that function, \displaystyle{\frac{d x_N}{d x_i}}, for all i.

Point #1: Any differentiable algorithm can be translated into a sequence of assignments of basic operations.


for i=n+1,n+2,...,N

x_i \leftarrow f_i({\bf x}_{\pi(i)})

Here, each function f_i is some very basic operation (e.g. addition, multiplication, a logarithm) and \pi(i) denotes the set of “parents” of x_i. So, for example, if \pi(7)=(2,5) and f_7 = \text{add}, then x_7 = x_2 + x_5.

It would be extremely tedious, of course, to actually write an algorithm in this “expression graph” form. So, autodiff tools create this representation automatically from high-level source code.

Point #2: Given an algorithm in the previous format, it is easy to compute its derivatives.

The essential point here is just the application of the chain rule.

\displaystyle{ \frac{d x_N}{d x_i} = \sum_{j:i\in \pi(j)} \frac{d x_N}{d x_j}\frac{\partial x_j}{\partial x_i}}

Applying this, we can compute all the derivatives in reverse order.


\displaystyle{ \frac{d x_N}{d x_N} \leftarrow 1}

for i=N-1,N-2,...,1

\displaystyle{ \frac{d x_N}{d x_i} \leftarrow \sum_{j:i\in \pi(j)} \frac{d x_N}{d x_j}\frac{\partial f_j}{\partial x_i}}

That’s it!  Just create an expression graph representation of the algorithm and differentiate each basic operation f_i in reverse order using calc 101 rules.

Other stuff:

  • No, this is not the same thing as symbolic differentiation.  This should be obvious:  Most algorithms don’t even have simple symbolic representations.  And, even if yours does,  it is possible that it “explodes” upon symbolic differentiation.  As a contrived example, try computing the derivative of \exp(\exp(\exp(\exp(\exp(\exp(x)))))).
  • The complexity of the back-prop step is the same as the forward propagation step.
  • In machine learning, functions from N inputs to one output come up all the time:  The N inputs are parameters defining a model, and the 1 output is a loss, measuring how well the model fits training data.  The gradient can be fed into an optimization routine to fit the model to data.
  • There are two common ways of implementing this:
  1. Operator Overloading.  One can create a new variable type that has all the common operations of numeric types, which automatically creates an expression graph when the program is run.   One can then call the back-prop routine on this expression graph.  Hence,  one does not need to modify the program, just replace each numeric type with this new type.  This is fairly easy to implement, and very easy to use.  David Gay‘s RAD toolbox for C++ is a good example, which I use all the time.
    The major downside of operator overloading is efficiency:  current compilers will not optimize the backprop code.  Essentially, this  step is interpreted.  Thus, one finds in practice a non-negligible overhead of, say, 2-15 times the complexity of the original algorithm using a native numeric type.  (The overhead depends on how much the original code benefits from compiler optimizations.)
  2. Source code transformation. Alternatively, one could write a program that examines the source code of the original program, and transforms this into source code computing the derivatives.  This is much harder to implement, unless one is using a language like Lisp with very uniform syntax.  However, because the backprop source code produced is then optimized like normal code, it offers the potential of zero overhead compared with manually computed derivatives.
  • If it isn’t convenient to use automatic differentiation, one can also use “manual automatic differentation”.  That is, to compute the derivatives, just attack each intermediate value your algorithm computes, in reverse order.
  • Some of the most interesting work on autodiff comes from Pearlmutter and Siskind, who have produced a system called Stalingrad for a subset of scheme that allows for crazy things like taking derivatives of code that itself is taking derivates.  (So you can, for example, produce Hessians.)  I think they wouldn’t mind hearing from potential users.

More Vector Calculus Identities

I realize that “rule 2” from the previous post is actually just a special case of the vector chain rule.

Rule 4. (Chain rule)

If f({\bf x}) = {\bf g}({\bf h}({\bf x})), then

J[{\bf f}] = J[{\bf g}]J[{\bf h}], or equivalently,

\frac{ \partial{\bf f} }{ \partial{\bf x}^T} = \frac{ \partial{\bf g} }{ \partial{\bf h}^T} \frac{ \partial{\bf h} }{ \partial{\bf x}^T}.

Here, I have used {\bf h} to denote the argument of {\bf g}. (That makes it look more like the usual chain rule.)

From this, you get the special case where g is a scalar function. (I use the non-boldface g in g({\bf h}) to suggest that g is a scalar function that operates ‘element-wise’ on vector input.)

Rule 4. (Chain rule– special case for a scalar function)

If f({\bf x}) = g({\bf h}({\bf x})), then

J[{\bf f}]({\bf x}) = \text{diag}[ g'({\bf h})] J[{\bf h}], or equivalently,

J[{\bf f}]({\bf x}) = g'({\bf h}) {\bf 1}^T \odot J[{\bf h}].

In the last line, I use the fact that \text{diag}({\bf x})A = {\bf x}{\bf 1}^T \odot A.

Finally, substituting g = g' = \exp gives the special case below.

Vector Calculus Identities


g({\bf x}) = {\bf p}({\bf x})^T {\bf q}({\bf x}).

That is, g is a scalar function of a vector {\bf x}, given by taking the inner product of the two vector valued functions {\bf p} and {\bf q}. Now, we would like the gradient of g, i.e. \frac{\partial g}{\partial \bf x}. What is it?

I frequently need to find such derivatives, and I have never been able to find any reference for rules to calculate them. (Though such rules surely exist somewhere!) Today, Alap and I derived a couple simple rules. The first answers the above question.

Rule 1.

If g({\bf x}) = {\bf p}({\bf x})^T {\bf q}({\bf x}), then,

\frac{\partial g}{\partial \bf x} = J[{\bf p}]^T {\bf q}({\bf x}) + J[{\bf q}]^T{\bf p}({\bf x}).

Here, J[{\bf p}]=\frac{\partial {\bf p}}{\partial {\bf x}^T} is the Jacobian. i.e. J_{i,j} = \frac{\partial p_i }{\partial x_j}

This rule is a generalization of the calculus 101 product rule, where g(x)=p(x) q(x) (everything scalar), and g'(x) = p'(x) q(x) + q'(x) p(x).

A different rule concerning exponentials is

Rule 2.

If {\bf f}({\bf x}) = \exp({\bf q}({\bf x})), then

J[{\bf f}]= \frac{\partial {\bf f}}{\partial {\bf x}^T} = \exp( {\bf q}({\bf x})) {\bf 1}^T \odot J[{\bf q}].

Here, \odot is the element-wise product. The strange product with {\bf 1}^T can be understood as “copying” the first vector. That is, {\bf x} {\bf 1}^T is a matrix where each column consists of \bf x. (The number of columns must be understood from context.)

Rule 3.

If {\bf f}({\bf x}) = {\bf p}({\bf x}) \odot {\bf q}({\bf x}), then

J[{\bf f}] ={\bf p}({\bf x}){\bf 1}^T \odot J[{\bf q}] + J[{\bf p}] \odot {\bf q}({\bf x}){\bf 1}^T

Surely there exists a large set of rules like this somewhere, but I have not been able to find them, despite needing things like this for years now. What I would really like to do is use a computer algebra system, such as Maple, Mathematica or Sage to do this, but that doesn’t seem possible at present. (It is suggested in that thread that Mathematica is up to the job, but I have tested it out found that not to be true.)