You have some function . You have figured out how to compute it’s gradient,
. Now, however, you find that you are implementing some algorithm (like, say, Stochastic Meta Descent), and you need to compute the product of the Hessian
with certain vectors. You become very upset, because either A) you don’t feel like deriving the Hessian (probable), or B) the Hessian has
elements, where
is the length of
and that is too big to deal with (more probable). What to do? Behold:
Consider the Hessian-Vector product we want to compute, . For small
,
And so,
This trick above has been around apparently forever. The approximation becomes exact in the limit . Of course, for small
, numerical problems will also start to kill you. Pearlmutter’s algorithm is a way to compute
with the same complexity, with out suffering rounding errors. Unfortunately, Pearlmutter’s algorithm is kind of complex, while the above is absolutely trivial.
Update: Perlmutter himself comments below that if we want to use the finite difference trick, we would do better to use the approximation:
This expression will be closer to the true value for larger , meaning that we are less likely to get hurt by rounding error. This is nicely illustrated on page 6 of these notes.