Hessian-Vector products

You have some function f({\bf x}).  You have figured out how to compute it’s gradient, g({\bf x})=\frac{\partial}{\partial \bf x}f({\bf x}).  Now, however, you find that you are implementing some algorithm (like, say, Stochastic Meta Descent), and you need to compute the product of the Hessian H({\bf x})=\frac{\partial^2}{\partial {\bf x}\partial{\bf x}^T}f({\bf x}) with certain vectors.  You become very upset, because either A) you don’t feel like deriving the Hessian (probable), or B) the Hessian has N^2 elements, where N is the length of \bf x and that is too big to deal with (more probable).  What to do?  Behold:

{\bf g}({\bf x}+{\bf \Delta x}) \approx {\bf g}({\bf x}) + H({\bf x}){\bf \Delta x}

Consider the Hessian-Vector product we want to compute, H({\bf x}){\bf v}.  For small r,

{\bf g}({\bf x}+r{\bf v}) \approx {\bf g}({\bf x}) + r H({\bf x}){\bf v}

And so,

\boxed{H({\bf x}){\bf v}\approx\frac{{\bf g}({\bf x}+r{\bf v}) - {\bf g}({\bf x})}{r}}

This trick above has been around apparently forever.  The approximation becomes exact in the limit r \rightarrow 0.  Of course, for small r, numerical problems will also start to kill you.  Pearlmutter’s algorithm is a way to compute H {\bf v} with the same complexity, with out suffering rounding errors.  Unfortunately, Pearlmutter’s algorithm is kind of complex, while the above is absolutely trivial.

Update: Perlmutter himself comments below that if we want to use the finite difference trick, we would do better to use the approximation:

\boxed{H({\bf x}){\bf v}\approx\frac{{\bf g}({\bf x}+r{\bf v}) - {\bf g}({\bf x}-r{\bf v})}{2r}}.

This expression will be closer to the true value for larger r, meaning that we are less likely to get hurt by rounding error. This is nicely illustrated on page 6 of these notes.

10 thoughts on “Hessian-Vector products

  1. Thanks for the cite. Couple nits.

    (a) If you’re going to use finite differences, you want (g(x+rv)-g(x-rv)) / 2r, to reduce discretization error from O(r) to O(r^2). This does not solve the “both sins of numeric analysis in one expression” issue, of course.

    (b) If you have AD available, you can convert g() to Hv() using a forward-mode AD xform. Most current high-performance AD tools won’t do nested applications though, and if you got g() from f() by applying reverse-mode AD to f() then going from g() to Hv() is a nested application.

    (c) This trick was known before the paper you cite (see refs therein, e.g., Werbos) so although I’m flattered…

    Jeff Siskind and I have been working on a super-aggressive AD-enabled optimizing compiler that can handle nesting. You’re welcome to play with it (on web, GPLed, “stalingrad” compiler for “VLAD” language), and we’d love any feedback, but it is a research prototype without amenities.

  2. Thanks for the response. I’ve updated the post to mention that one should really used centered differences. (I should know better than that, really.)

    I couldn’t quite figure out where the finite difference trick originated. I meant to imply in the post that it came from the misty past somewhere.

    I’ve been very interested in your stalingrad project. It seems like the only game in town, in terms of joint autodiff + optimizing compiler. I tried downloading the code and running the examples a few months ago. One inevitable comment is that the project could use more documentation. I was able to figure out, by searching around newsgroup postings and the like, that: 1) stalin is a fast scheme-to-c compiler, 2) vlad is the name of a new language closely related to scheme , and 3) stalingrad is an implementation of vlad, but this took some effort. I haven’t been able to find any documentation for vlad, aside from the brief description on p. 17 of the paper. Being able to write schemeish code, and get fortran speed with free nested application of AD is *very* attractive!

  3. The trick is used to calculate the product of Hv, but in the weight update equation we need H^-1v.

    Am I wrong somewhere?

  4. In what weight update equation do you mean? For Newton’s method, sure, you’d want inv(H)v and the trick above can only be useful for some kind of iterative method to solve the system. On the other hand, with Stochastic Meta Descent (which was one motivation for this):


    You do indeed just want H*v. Many other applications are similar.

  5. If central difference works, why not using complex step? Indeed, when r is really small, you will hit a round-off error due to0/0. While complex step is a way to resolve it perfectly.
    That is the best way to compute a gradient numerically and without much rounding off error.

  6. Sure, a complex step is much superior numerically. However, this may not be convenient if the code cannot (easily) support complex numbers– this is particularly common when the function you are working with is implemented on specialized hardware.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s