Stochastic Meta-Descent

As ever, \odot is the elementwise product.

w_{t+1}=w_t-p_t \odot g_t

p_t = p_{t-1}\odot \exp(\mu v_t \cdot g_t)

v_{t+1}=\lambda v_t + p_t \odot(g_t - \lambda H_t v_t)

The hessian-vector product can be approximated efficiently, using only gradient evaluations.

One Response to Stochastic Meta-Descent

  1. twolfe18 says:

    Schraudolph says in a lecture that you usually want to approximate exp(x) with max(1/2, x) when updating p. This function 1) is much faster on the CPU (important as it is in an inner loop), 2) is a reasonable estimate if \mu is small, and 3) is robust in the case where x happens to get really big. In http://videolectures.net/mlss06au_schraudolph_aml/ (part 2).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s