This is a very simple trick I have had to rederive many times. Often you need to optimize over some parameters that need to be constrained to be positive. One solution is to reparameterize with some parameters , where the correspondence is

.

An unconstrained optimization over is equivalent to a constrained optimization over . Now, if the function we are optimizing is , we have a trivial correspondence between gradients:

(Notation: is the elementwise product, and is a vector of the derivatives .)

If we were to do gradient descent on , the updates would look like

.

But, why not forget about , and do this same update directly on ?

So the final update formula to remember is

Advertisements

The quick, straightforward explanation is appreciated.

I see a lot of literature on the Exponentiated Gradient (EG) algorithm, which is essentially this except the dx/d\theta term is left out — so you’re just exponentiating the gradient (times learning rate) and applying it as a multiplicative update. This is no longer strictly gradient descent with respect to \theta = log(x), but sounds like it still has some nice properties.

Anyone have any intuition about when EG makes sense vs this kind of “log-domain gradient descent”?