Matrix Calculus

Based on a lot of requests from students, I did a lecture on matrix calculus in my machine learning class today. This was based on Minka’s Old and New Matrix Algebra Useful for Statistics and Magnus and Neudecker’s Matrix Differential Calculus with Applications in Statistics and Econometrics.

In making the notes, I used a couple innovations, which I am still debating the wisdom of. The first is the rule for calculating derivatives of scalar-valued functions of a matrix input f(X). Traditionally, this is written like so:

if dy = \text{tr}(A^T dX) then \frac{dy}{dX} = A.

I initially found the presence of the trace here baffling. However, there is the simple rule that

\text{tr}(A^T B) = A \cdot B

where \cdot is the matrix inner product. This puts the rule in the much more intuitive (to me!) form:

if dy = A \cdot dX then \frac{dy}{dX} = A.

This seems more straightforward, but it comes at a cost. When working with the rule in the trace form, one often needs to do quite a bit of shuffling around of matrices. This is easy to do using the standard trace identities like


If we are to work with inner-products, we will require a similar set of rules. It isn’t too hard to show that there are “dual” identities like

A \cdot (BC) = B \cdot (AC^T) = C \cdot (B^T A)

which allow a similar shuffling with dot products. Still, these are certainly less easy to remember.

There are also a set of other rules that seem to be needed in practice, but aren’t included in standard texts. For example, if R is a function that is applied elementwise to a matrix or vector (e.g. \sin), then

d(R(F)) = R'(F(X)) \odot dF

where \odot is the elementwise product. This then requires other (very simple) identities for getting rid of the elementwise product, such as

{\bf x}\odot{\bf y} = \text{diag}({\bf x}) {\bf y} = \text{diag}({\bf y}) {\bf x}.

Another issue with using dot products everywhere is the need to constantly convert between transposes and inner-products. (This issue comes up because I prefer a “all vectors are column vectors” convention) The never-ending debate of if we should write

{\bf x} \cdot {\bf y}


{\bf x}^T {\bf y}

seems to have particular importance here, and I’m not sure of the best choice.

This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

11 Responses to Matrix Calculus

  1. ben says:

    Following the notation of Strang’s “Linear Algebra and it’s applications”, Golub & Van Loan’s “Matrix Computations”, Trefethen’s “Numerical Linear Algebra”, and Boyd’s “Convex Optimization” which all share the x^T y approach, that is what I prefer, at least from my engineering background. When we write the product of two matrices A*B it seems like the two notations have opposite meaning, one being A*B and the other being A^T*B. The x^T y notation seems to keep track of the dimensions for you. I think sometimes the push for smaller notation obscures the mechanics of things.

  2. Check out “Some Eclectic Matrix Theory” by Kenneth S. Miller (1987, Robert E. Krieger Publishing Company). It has a fantastic formalism for matrix calculus, derived (hah hah, sometimes I just crack myself up) at breakneck speed in a very short chapter. Basically, the big trick is to use the Kronecker Product to define dM/dX (where M is a matrix function of the matrix X) and then all the rules come out pretty simple, unlike the horrible things that happen with vec().

  3. justindomke says:

    Thanks! Sounds great. I just ordered it.

    BTW, I totally agree about vec(). My conclusion when reading Magnus and Neudecker was that if this is what is needed to calculate matrix/matrix derivatives, well, I just won’t calculate any matrix/matrix derivatives.

  4. BTW, Minka’s is transpose of yours — tr(A dX) means d/dX is A, hence differential A.dX implies d/dX = A’ This means that shape of d/dY is transpose of Y. I like this choice because then chain rule works without modifications

  5. BTW, the “block matrix” oriented notation of Miller (section 13) is addressed in Magnus as an example of bad notation (Definition 3 in Magnus 9.3). One problem is that arranging partial derivatives this way loses standard properties of Jacobian matrix — rank and determinant of this matrix are meaningless and the chain rule doesn’t work. Their solution is to flatten the function and argument using vec, and then use standard definition of vector-valued function of vector-valued argument. This also simplifies some algebra, for instance, in this notation, derivative of matrix inverse f(X) is -f(X’) x f(X) where x is Kronecker product (eq.19 on p.183 Magnus). Using Miller’s notation it’s (-f(X) x I)(f(X) x I) (eq 14 on p.62 Miller)

  6. Naturally, I think that the Magnus comment on Miller’s “block matrix” notation is quite misguided. A matrix is not just a vector of numbers: it is also a representation of a linear function. Just stringing the numbers out into a vector loses this. If you want to treat the input and output to a function as vectors rather than matrices, pre/post-compose with vec^{-1} and vec. The Miller notation handles this just fine, and is therefore strictly more powerful than the flatten-it-all notation advocated in Magnus.

  7. The issue is that the derivative is not just an array of numbers — it is also a linear function. The matrix you get out Magnus’ notation precisely corresponds to this linear function, and you can multiply corresponding derivatives to get derivative of the composition. How would you do chain rule using Miller’s notation?

  8. The relevant (and extremely short) chapter in Miller exhibits the appropriate chain rule, which is in fact quite succinct.

  9. Also: “the matrix you get out Magnus’ notation precisely corresponds to this linear function”. This is true, with a caveat: in a particular basis. The same statement is true of Miller’s notation, but with a different basis.

  10. This particular basis is what turns composition into regular matrix multiplication. I could not find chain rule in Miller’s chapter, I’m assuming you meant chapter 13 which gives derivatives for matrix products and inverse

  11. Daniel says:

    There shouldn’t really be a debate about whether to use x^Ty or x.y, they are clear benefits to swapping between them on the fly. The former permits use of certain distributive laws that would be more obvious for a student used to dealing with matrix and vector manipulation, the latter is useful when we choose to apply the induced innerproduct of a reproducing kernel Hilbert space, K(x,y).

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s