Matrix Calculus

Based on a lot of requests from students, I did a lecture on matrix calculus in my machine learning class today. This was based on Minka’s Old and New Matrix Algebra Useful for Statistics and Magnus and Neudecker’s Matrix Differential Calculus with Applications in Statistics and Econometrics.

In making the notes, I used a couple innovations, which I am still debating the wisdom of. The first is the rule for calculating derivatives of scalar-valued functions of a matrix input f(X). Traditionally, this is written like so:

if dy = \text{tr}(A^T dX) then \frac{dy}{dX} = A.

I initially found the presence of the trace here baffling. However, there is the simple rule that

\text{tr}(A^T B) = A \cdot B

where \cdot is the matrix inner product. This puts the rule in the much more intuitive (to me!) form:

if dy = A \cdot dX then \frac{dy}{dX} = A.

This seems more straightforward, but it comes at a cost. When working with the rule in the trace form, one often needs to do quite a bit of shuffling around of matrices. This is easy to do using the standard trace identities like


If we are to work with inner-products, we will require a similar set of rules. It isn’t too hard to show that there are “dual” identities like

A \cdot (BC) = B \cdot (AC^T) = C \cdot (B^T A)

which allow a similar shuffling with dot products. Still, these are certainly less easy to remember.

There are also a set of other rules that seem to be needed in practice, but aren’t included in standard texts. For example, if R is a function that is applied elementwise to a matrix or vector (e.g. \sin), then

d(R(F)) = R'(F(X)) \odot dF

where \odot is the elementwise product. This then requires other (very simple) identities for getting rid of the elementwise product, such as

{\bf x}\odot{\bf y} = \text{diag}({\bf x}) {\bf y} = \text{diag}({\bf y}) {\bf x}.

Another issue with using dot products everywhere is the need to constantly convert between transposes and inner-products. (This issue comes up because I prefer a “all vectors are column vectors” convention) The never-ending debate of if we should write

{\bf x} \cdot {\bf y}


{\bf x}^T {\bf y}

seems to have particular importance here, and I’m not sure of the best choice.