This is a very simple trick I have had to rederive many times. Often you need to optimize over some parameters $\bf x$ that need to be constrained to be positive. One solution is to reparameterize with some parameters $\boldsymbol \theta$, where the correspondence is

${\bf x} = \exp({\boldsymbol \theta})$.

An unconstrained optimization over $\boldsymbol \theta$ is equivalent to a constrained optimization over $\bf x$. Now, if the function we are optimizing is $f$, we have a trivial correspondence between gradients:

$\frac{df}{d{\boldsymbol \theta}}=\frac{df}{d{\bf x}}\odot\frac{d\bf x}{d\boldsymbol\theta}$

(Notation: $\odot$ is the elementwise product, and $\frac{d\bf x}{d\boldsymbol\theta}$ is a vector of the derivatives $dx_i/d\theta_i$.)

If we were to do gradient descent on $\boldsymbol \theta$, the updates would look like

${\boldsymbol \theta}' = {\boldsymbol \theta} - \lambda \frac{df}{d{\boldsymbol \theta}}$.

But, why not forget about ${\boldsymbol \theta}$, and do this same update directly on $\bf x$?

${\bf x}' = \exp({\boldsymbol \theta}') = \exp({\boldsymbol \theta}) \odot \exp(- \lambda \frac{df}{d{\boldsymbol \theta}})$

So the final update formula to remember is
$\boxed{{\bf x}'= {\bf x} \odot \exp(- \lambda {\bf x} \odot \frac{df}{d\bf x})}$

2 Responses to Log Gradient Descent

1. Josh says:

The quick, straightforward explanation is appreciated.

2. Matthew W says:

I see a lot of literature on the Exponentiated Gradient (EG) algorithm, which is essentially this except the dx/d\theta term is left out — so you’re just exponentiating the gradient (times learning rate) and applying it as a multiplicative update. This is no longer strictly gradient descent with respect to \theta = log(x), but sounds like it still has some nice properties.

Anyone have any intuition about when EG makes sense vs this kind of “log-domain gradient descent”?