Matrix Calculus

Based on a lot of requests from students, I did a lecture on matrix calculus in my machine learning class today. This was based on Minka’s Old and New Matrix Algebra Useful for Statistics and Magnus and Neudecker’s Matrix Differential Calculus with Applications in Statistics and Econometrics.

In making the notes, I used a couple innovations, which I am still debating the wisdom of. The first is the rule for calculating derivatives of scalar-valued functions of a matrix input f(X). Traditionally, this is written like so:

if dy = \text{tr}(A^T dX) then \frac{dy}{dX} = A.

I initially found the presence of the trace here baffling. However, there is the simple rule that

\text{tr}(A^T B) = A \cdot B

where \cdot is the matrix inner product. This puts the rule in the much more intuitive (to me!) form:

if dy = A \cdot dX then \frac{dy}{dX} = A.

This seems more straightforward, but it comes at a cost. When working with the rule in the trace form, one often needs to do quite a bit of shuffling around of matrices. This is easy to do using the standard trace identities like

\text{tr}(ABC)=\text{tr}(CAB).

If we are to work with inner-products, we will require a similar set of rules. It isn’t too hard to show that there are “dual” identities like

A \cdot (BC) = B \cdot (AC^T) = C \cdot (B^T A)

which allow a similar shuffling with dot products. Still, these are certainly less easy to remember.

There are also a set of other rules that seem to be needed in practice, but aren’t included in standard texts. For example, if R is a function that is applied elementwise to a matrix or vector (e.g. \sin), then

d(R(F)) = R'(F(X)) \odot dF

where \odot is the elementwise product. This then requires other (very simple) identities for getting rid of the elementwise product, such as

{\bf x}\odot{\bf y} = \text{diag}({\bf x}) {\bf y} = \text{diag}({\bf y}) {\bf x}.

Another issue with using dot products everywhere is the need to constantly convert between transposes and inner-products. (This issue comes up because I prefer a “all vectors are column vectors” convention) The never-ending debate of if we should write

{\bf x} \cdot {\bf y}

or

{\bf x}^T {\bf y}

seems to have particular importance here, and I’m not sure of the best choice.

Posted in Uncategorized | Tagged , , | 14 Comments

Automatic Differentiation Without Compromises

Automatic differentiation is a classic numerical method that takes a program, and (with minimal programmer effort) computes the derivatives of that program. This is very useful because, when optimizing complex functions, a lot of time tends to get spent manually deriving and then writing code for derivatives. Some systems like cvx do a great job of recognizing simple problem types (linear programs, cone programs, etc.), but can’t handle arbitrary functions.

At present, automatic differentiation involves a degree of pain. Typically, the user needs to write C or C++ or Fortran code and augment the code with “taping” commands to get it to work. There is also usually a significant computational overhead.

In a just world, there would be no pain, and there would be no computational overhead. In a just world, reverse-mod autodiff would work like this:

STEP 1 The user writes the function they want to optimize in a convenient high level language like Python.

This is easy, but slow, and provides no derivatives.

Consider a function that sums the elements in a matrix product. The user (apparently unaware that numpy provides matrix multiplication) writes:

    def fun(A,B):
        rez = 0
        for i in range(A.shape[0]):
            for j in range(B.shape[1]):
                rez += dot(A[i,:],B[:,j])
        return rez

STEP 2 The user feeds their high level function into a library. This library uses operator overloading magic to build an expression graph representation of the function. This expression graph is then used to generate efficient machine ode for both the function, and its derivatives.

The user merely should have to do something like

    A = numpy.random.rand(3,5)
    B = numpy.random.rand(5,4)
    dfun = compile_fun(fun,A,B)

They could then compute the function and derivatives (all in compiled code) using

    F,[dFdA,dFdB] = dfun(A,B)

Thus, the user gets everything, with out having to compromise: maximum performance, and maximum convenience.


Since this is how things should work, I spent some time this past summer writing a library that does. The above code is actually a working example, calling the compile_fun method– the only method a consumer of the library needs to know about.  This library comes tantalizingly close to my goals.

Another simple example, mixing matrices and scalars:

def fun(A,B,a):
    return sum(sum((A*a-B)**2))
a = 2.
A = random.rand(15,22)
B = random.rand(15,22)
import etree
dfun = etree.compile_fun(fun,A,B,a)
F,[dFdA,dFdB,dFda] = dfun(A,B,a)

Consider the above matrix multiplication example, with 60×60 matrices. The library can generate the expression graph, transform that into a simple bytecode, than transform that bytecode into C code acceptably quickly– 11.97 seconds on my machine.  The code then runs very fast:  .0086 seconds as opposed to .0254 seconds for just running the original function or .0059 seconds for calling numpy’s matrix multiplication routine dot(A,B). (Which, of course, does not provide derivatives!)

There is, inevitably, one large problem:  horrific compilation times on large functions.  To take the C code and transform it into machine code, gcc takes 1116 seconds.  Why, you ask?  The reason is because gcc is trying to compile a single 36.5 MB function:

#include "math.h"
void rock(double* A, double* D){
A[12800]=A[0] * A[6400];
A[12801]=A[1] * A[6480];
A[12802]=A[12800] + A[12801];
A[12803]=A[2] * A[6560];
A[12804]=A[12802] + A[12803];
A[12805]=A[3] * A[6640];
// thousands of pages of same
D[12801] = D[12802];
D[1] += D[12801] * A[6480];
D[6480] += D[12801] * A[1];
D[0] += D[12800] * A[6400];
D[6400] += D[12800] * A[0];
}

Though this is very simple code, it seems that gcc still uses some super-linear algorithms, so it can’t handle large functions.

(To be clear, for small or medium sized problems, the code compiles very quickly.)

Now, clearly, I am doing something wrong. Frankly I am writing this largely in the hope that someone who actually knows something about compilers and programming languages can enlighten me.

1) Use a better compiler.  I’ve tried this.  I can turn off gcc’s optimization, which decreases compilation times, though to a still very slow level.  Alternatively, I could use a fast compiler.  tcc flies right through the code.  Unfortunately, the machine code that tcc generates is very slow– I think due to poor register allocation.

2) Don’t generate intermediate C code– directly generate machine code or assembly.  This would certainly work, but there is no way I am up to it. Maybe it would make sense to target the JVM?

3) Use one of the newfangled just in time virtual machine things, like LLVM or libJIT.  This seems promising, but after quite a bit of research I’m still really unsure about how well this is likely to work, or how to proceed.

4) Write a fast interpreter, and just pass the bytecode to this.  I’ve tried this as well.  This is basically the strategy used by some autodiff packages, e.g. cppad. This can get reasonable performance, but will never compete with native code. The reason is– I think– that the interpreter is basically a giant switch statement, and all the branching screws up the pipeline on the CPU.

Anyway, the code is available as etree.py, and a tiny file mycode.h. (Let’s say it’s available under the MIT license, and the condition that you not laugh too much at my programming skills.) To try it out, just start python and do

import etree
etree.runtests()

At this point, the library is useful (I use it!) but only for relatively small problems, containing no more than, say, a few thousand variables.

If anyone has ideas for better compiler flags, or knows about the feasibility of targeting LLVM or libJIT instead of C code, please get in touch.

Disclaimer: There are a bunch of minor issues with the library. These are all fixable, but it isn’t really worth the effort unless the compilation times can be dealt with.

  • The function cannot branch. (No if statements).
  • If you want to multiply a matrix by a scalar, the scalar must come after the matrix.
  • I know it works on 32 bit ubuntu with everything installed via synaptic package manager, and on 64 bit OS X, simply using Sage (with this patch to make scipy/weave work). If you want to play around with this, but aren’t a python programmer, I would suggest using Sage.
Posted in Uncategorized | Tagged , , | 7 Comments

Notation is evil

Exhibit A:

We have the symbol \propto,with the interpretation

x \propto y \leftrightarrow x = c y for some number c,

but there doesn’t appear to exist a symbol (Here, I use a boxed question mark: \boxed{?} to denote the symbol I claim doesn’t exist) with the interpretation

x \boxed{?} y \leftrightarrow x = y + c for some number c.

This pains me.  People sometimes have to resort to writing something like

y = f(x)+\text{const} (1)

= g(x) + \text{const} (2)

where the constants are (in general) different on lines (1) and (2).”

Even worse (or maybe not?), sometimes people seem to leave exponents lying around when they otherwise wouldn’t, e.g. write

\exp(y) \propto \exp(f(x)) \propto \exp(g(x)).

Exhibit B:

We have no symbol meaning “normalized sum”.  How many thousands of times have you seen some variant of

y = \frac{1}{N}\sum_{x \in X} f(x)

where N=|X|“?

Why do we need to define N?  Can’t we just use another mystery symbol to write

y = \boxed{?}_{x \in X} f(x)?

In some situations you could use \text{mean}, but that doesn’t always really work and is rarely done.

Posted in Uncategorized | 3 Comments

Fitting an inference algorithm instead of a model

One recent trend seems to be the realization that one can get better performance by tuning a CRF (Conditional Random Field) to a particular inference algorithm. Basically, forget about the distribution that the CRF represents, and instead only care how accurate are the results that pop out of inference. An extreme example of this is the recent paper Learning Real-Time MRF Inference for Image Denoising by Adrian Barbu.

The basic idea is to fit a FoE (Field of Experts) image prior such that when one takes a very few gradient descent steps on a denoising posterior, the results are accurate. From the abstract:

We argue that through appropriate training, a MRF/CRF model can be trained to perform very well on a suboptimal inference algorithm. The model is trained together with a fast inference algorithm through an optimization of a loss function […] We apply the proposed method to an image denoising application […] with a 1-4 iteration gradient descent inference algorithm. […] the proposed training approach obtains an improved benchmark performance as well as a 1000-3000 times speedup compared to the Fields of Experts MRF.

The implausible-sounding 1000-fold speedup comes simply from using only 4 iterations of gradient descent rather than several thousand. (Incidentally, L-BFGS requires far fewer iterations for this problem.) The results are a bit better than the generative FoE model– that takes much more work for training and inference.

I have every confidence that this does work well, and similar strategies could probably be used to create fast inference models/algorithms for many different problems. My thesis was largely an attempt to do the same thing for marginal, rather than MAP inference.

The disturbing/embarrassing question, for me, is does this really have anything to do with probabilistic modeling any more? Formally speaking, a probability density is being fit, but I doubt it would transfer to, say, inpainting, or that samples from the density would look like natural images. The best interpretation of what is happening might be that one is simply fitting a big, nonlinear, black box function approximation.

It seems that the more effort we expend to wring the best performance out of a probabilistic model, the less “probabilistic” it is.

P.S. Some of my friends have invited me to never mention autodiff again, ever, but this is one of the many papers where I think the learning optimization would be made much easier/faster by using it.
Posted in Uncategorized | Tagged , , | 1 Comment

Bird

Posted in Uncategorized | Tagged , | 1 Comment

What Gauss-Seidel is Really Doing

I’ve been reading Alan Sokal’s lecture notes “Monte Carlo Methods in Statistical Mechanics: Foundations and New Algorithms” today. Once I learned to take the word “Hamiltonian” and mentally substitute “function to be minimized”, they are very clearly written.

Anyway, the notes give an explanation of the Gauss-Seidel iterative method for solving linear systems that is so clear, I feel a little cheated that it was never explained to me before. Let’s start with the typical (confusing!) explanation of Gauss-Seidel (as in, say, Wikipedia). You want to solve some linear system

A{\bf x}={\bf b},

for \bf x. What you do is decompose A into a diagonal matrix D, a strictly lower triangular matrix L, and a strictly upper triangular matrix U, like

A=L+D+U.

To solve this, you iterate like so:

{\bf x} \leftarrow (D+L)^{-1}({\bf b}-U{\bf x}).

This works when A is symmetric positive definite.

Now, what the hell is going on here? The first observation is that instead of inverting A, you can think of minimizing the function

\frac{1}{2} {\bf x}^T A {\bf x} - {\bf x}^T{\bf b}.

(It is easy to see that the global minimum of this is found when A{\bf x}={\bf b}.) Now, how to minimize this function? One natural way to approach things would be to just iterate through the coordinates of {\bf x}, minimizing the function over each one. Say we want to minimize with respect to x_i. So, we are going to make the change x_i \leftarrow x_i + d. Thus, we want to minimize (with respect to d), this thing:

\frac{1}{2} ({\bf x}+d {\bf e_i})^T A ({\bf x}+d {\bf e_i}) - ({\bf x}+d {\bf e_i})^T {\bf b}

Expanding this, taking the derivative w.r.t. d, and solving gives us (where {\bf e_i} is the unit vector in the ith direction)

d = \frac{{\bf e_i}^T({\bf b}-A{\bf x})}{{\bf e_i}^TA{\bf e_i}},

or, equivalently,

d = \frac{1}{A_{ii}}(b_i-A_{i*}{\bf x}).

So, taking the solution at one timestep, {\bf x}^t, and solving for the solution {\bf x}^{t+1} at the next timestep, we will have, for the first index,

x^{t+1}_1 = x^t_1 + \frac{1}{A_{11}}(b_1-A_{1*}x^t).

Meanwhile, for the second index,

x^{t+1}_2 = x^t_2 + \frac{1}{A_{22}}(b_2-A_{2*}(x^t-{\bf e_1} x^t_1 +{\bf e_1} x^{t+1}_1 )).

And, more generally

x^{t+1}_n = x^t_n + \frac{1}{A_{nn}}(b_n-A_{n*}(\sum_{i<n}{\bf e_i} x^{t+1}_i +\sum_{i\geq n}{\bf e_i} x^t_i )).

Now, all is a matter of cranking through the equations.

x^{t+1}_n = x^t_n + \frac{1}{A_{nn}}(b_n-\sum_{i<n}A_{n i} x^{t+1}_i - \sum_{i\geq n}A_{n i} x^t_i ))

D {\bf x}^{t+1} = D {\bf x}^t + {\bf b}- L x^{t+1} - (D+U) {\bf x}^t

(D+L) {\bf x}^{t+1} = {\bf b}-U {\bf x}^t

Which is exactly the iteration we started with.

The fact that this would work when A is symmetric positive definite is now obvious– that is when the objective function we started with is bounded below.

The understanding that Gauss-Seidel is just a convenient way to implement coordinate descent also helps to get intuition for the numerical properties of Gauss-Seidel (e.g. quickly removing "high frequency error", and very slowly removing "low frequency error").

Posted in Uncategorized | Tagged , , | 1 Comment

Numbers in bar graphs

I spent far too much time today figuring out how to put numbers in bar graphs. i.e. how to transform this:
graph_nonums

into this:

graph_nums

When reading papers, I often wish numbers were included like this, and so resolved a while ago to always do so myself.  Plotting the numbers allows someone else to replot them when doing some sort of a comparison.  (With out needing to implement your algorithm, or ask you for your doubtlessly long-lost data, or (ugh) physically measuring the lengths of your bars.)  However, now seeing how tedious this can be, I understand why it is rarely done.

In any case, I wrote a function add_bar_nums.m that you can run after doing a bar plot that will add the numbers like above.  There are options for rotating the text, changing the display format, etc.

Posted in Uncategorized | Tagged , , | 3 Comments