# Hessian-Vector products

You have some function $f({\bf x})$.  You have figured out how to compute it’s gradient, $g({\bf x})=\frac{\partial}{\partial \bf x}f({\bf x})$.  Now, however, you find that you are implementing some algorithm (like, say, Stochastic Meta Descent), and you need to compute the product of the Hessian $H({\bf x})=\frac{\partial^2}{\partial {\bf x}\partial{\bf x}^T}f({\bf x})$ with certain vectors.  You become very upset, because either A) you don’t feel like deriving the Hessian (probable), or B) the Hessian has $N^2$ elements, where $N$ is the length of $\bf x$ and that is too big to deal with (more probable).  What to do?  Behold:

${\bf g}({\bf x}+{\bf \Delta x}) \approx {\bf g}({\bf x}) + H({\bf x}){\bf \Delta x}$

Consider the Hessian-Vector product we want to compute, $H({\bf x}){\bf v}$.  For small $r$,

${\bf g}({\bf x}+r{\bf v}) \approx {\bf g}({\bf x}) + r H({\bf x}){\bf v}$

And so,

$\boxed{H({\bf x}){\bf v}\approx\frac{{\bf g}({\bf x}+r{\bf v}) - {\bf g}({\bf x})}{r}}$

This trick above has been around apparently forever.  The approximation becomes exact in the limit $r \rightarrow 0$.  Of course, for small $r$, numerical problems will also start to kill you.  Pearlmutter’s algorithm is a way to compute $H {\bf v}$ with the same complexity, with out suffering rounding errors.  Unfortunately, Pearlmutter’s algorithm is kind of complex, while the above is absolutely trivial.

Update: Perlmutter himself comments below that if we want to use the finite difference trick, we would do better to use the approximation:

$\boxed{H({\bf x}){\bf v}\approx\frac{{\bf g}({\bf x}+r{\bf v}) - {\bf g}({\bf x}-r{\bf v})}{2r}}.$

This expression will be closer to the true value for larger $r$, meaning that we are less likely to get hurt by rounding error. This is nicely illustrated on page 6 of these notes.