Hessian-Vector products

You have some function f({\bf x}).  You have figured out how to compute it’s gradient, g({\bf x})=\frac{\partial}{\partial \bf x}f({\bf x}).  Now, however, you find that you are implementing some algorithm (like, say, Stochastic Meta Descent), and you need to compute the product of the Hessian H({\bf x})=\frac{\partial^2}{\partial {\bf x}\partial{\bf x}^T}f({\bf x}) with certain vectors.  You become very upset, because either A) you don’t feel like deriving the Hessian (probable), or B) the Hessian has N^2 elements, where N is the length of \bf x and that is too big to deal with (more probable).  What to do?  Behold:

{\bf g}({\bf x}+{\bf \Delta x}) \approx {\bf g}({\bf x}) + H({\bf x}){\bf \Delta x}

Consider the Hessian-Vector product we want to compute, H({\bf x}){\bf v}.  For small r,

{\bf g}({\bf x}+r{\bf v}) \approx {\bf g}({\bf x}) + r H({\bf x}){\bf v}

And so,

\boxed{H({\bf x}){\bf v}\approx\frac{{\bf g}({\bf x}+r{\bf v}) - {\bf g}({\bf x})}{r}}

This trick above has been around apparently forever.  The approximation becomes exact in the limit r \rightarrow 0.  Of course, for small r, numerical problems will also start to kill you.  Pearlmutter’s algorithm is a way to compute H {\bf v} with the same complexity, with out suffering rounding errors.  Unfortunately, Pearlmutter’s algorithm is kind of complex, while the above is absolutely trivial.

Update: Perlmutter himself comments below that if we want to use the finite difference trick, we would do better to use the approximation:

\boxed{H({\bf x}){\bf v}\approx\frac{{\bf g}({\bf x}+r{\bf v}) - {\bf g}({\bf x}-r{\bf v})}{2r}}.

This expression will be closer to the true value for larger r, meaning that we are less likely to get hurt by rounding error. This is nicely illustrated on page 6 of these notes.