Stochastic Approximation

“Stochastic approximation” is an oddly named field, that I suspect will be of increasing interest to the machine learning community. I would prefer a name like “stochastic optimization” (which appears to be taken by several other areas already.)

The basic idea of stochastic approximation is that we would like to optimize some function f({\bf x}). However, we only have noisy access to f. The type of problem depends on exactly what we can measure.

Robbins-Monro

In a Robbins-Monro problem, we have access to unbiased estimates of the gradient of f. That is, for any {\bf x}, we can generate (by performing an experiment, or running a simulation, or whatever) a random vector {\bf g}, where the expectation of \bf g is the true gradient of f.

E[g({\bf x})] = \partial f / \partial {\bf x}

The basic result that Robbins and Monro came up with in 1951 is that f can be optimized by taking decreasing steps along the estimated gradients. This general stochastic gradient descent algorithm is:

1. For t=1,2,...

2. Get a noisy, unbiased estimate of the gradient: {\bf g}_t \approx \nabla f({\bf x}_t)

3. Step along the estimated gradient: {\bf x}_{t+1} = {\bf x}_t - \lambda_t {\bf g}_t

The step sizes \lambda_t need to be a sequence satisfying

1) \forall t, \sum_{t}^{\inf}  \lambda_t = \infty

2) \lim_{t\rightarrow\infty}=0.

(i.e. the step sizes go to zero, but there is always an “infinite amount of step left”). The sequence \lambda_t = a/(b+t) for some constants a and b is the typical sequence satisfying this.

It is critical that the above sequence is guaranteed to find a local minima of f only asymptotically. I think it is underemphasized in the stochastic approximation literature that this is a weak guarantee, both in theory and in practice, since any real application can take only a finite number of steps. Tuning the constants a and b well is critical, and often difficult. It is very common to simply disregard the theory and use a small fixed step size \lambda_t = a, again suggesting that the above theory should not be taken too seriously. With a fixed step size, one finds in practice that {\bf x} moves close to a minima, and then bounces around forever, with the size of the area in which it is bouncing depending on the variance of \bf g. (If you can do anything to reduce the variance of \bf g, do it! (See common random numbers, below, someday.))

More recently, Polyak and Juditsky worked on “iterate averaging”. The idea is to let the step sizes \lambda_t go to zero slower than O(1/t), but consider the solution to be the average of the {\bf x}_t over some previous number of steps. (The number of steps that we average over increases with t). This can lead to faster convergence.

Kiefer-Wolfowitz

In a Kiefer-Wolfowitz problem, one does not have access to noisy estimates of \nabla f, but only f itself. In their 1952 paper, they suggested, essentially, estimating the gradient by a finite-difference kind of scheme.

g_i = \frac{ f({\bf x} + c {\bf e_i})-f({\bf x} - c {\bf e_i})}{2 c}

There are two major problems with this. First, to estimate one gradient, we will need to get 2N estimates of f, where N is the number of dimensions of \bf x. This could be very expensive. Second, for any given c, the above is a biased estimate of the true gradient. One can reduce this bias arbitrarily by making c very small, but this causes a catastrophic variance explosion: For small c, the estimated gradient will be dominated by the random fluctuations in f, rather than the fluctuation to the the change of \bf x in which we are interested. (The noise is of order O(1/c)) This variance can be reduced, either again by techniques like common random numbers, or by, e.g. estimating many gradients and averaging them. Basically, for K-W to be practical, one should have a problem with few dimensions, a function f that is cheap to evaluate and/or low variance in f.

1. For t=1,2,...

2. Get a noisy, unbiased estimate of the gradient: \forall i, {g}_{t,i} = \frac{ f({\bf x} + c_t {\bf e_i})-f({\bf x} - c_t {\bf e_i})}{2 c_t}

3. Step along the estimated gradient: {\bf x}_{t+1} = {\bf x}_t - \lambda_t {\bf g}_t

The Kiefer-Wolfowitz algorithm is guaranteed to asymptotically converge to a minima of f with the sequences c_t and \lambda_t both slowly going to zero. In practice, however, because of the “variance explosion”, above, one almost always just picks a fixed constant c. This keeps the variance bounded, at the cost of biasing somewhat the final solution. Interestingly, Flaxman et al. showed that this amounts to minimizing a “smoothed” version of f, where the smoothing depends on c.

To come:

1. SPSA

2. Infinitesimal Perturbation Analysis

3. The Likelihood Ratio method.

4. Common Random Numbers, and Variance Reduction Methods more generally

5. Lecun et al.’s heuristics for training neural networks as variance reduction methods.

A standard reference on stochastic approximation is “Stochastic Approximation Algorithms and Applications” by Kushner and Yin. Spall’s “Introduction to Stochastic Search and Optimization” is more applied and more accessible.

Recommended:
On the choice of random directions for stochastic approximation algorithms by James Theiler, Jarod Alper

Some papers on optimization by random directions:
Learning curves for stochastic gradient descent in linear feedforward networks by Justin Werfel and Xiaohui Xie

A fast stochastic Error-Descent Algorithm for Supervised Learning and Optimization by Gert Cauwenberghs

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s