A classic implementation issue in machine learning is reducing the cost of computing the sigmoid function

.

Specifically, it is common to profile your code and discover that 90% of the time is spent computing the in that function. This comes up often in neural networks, as well as in various probabilistic architectures, such as sampling from Ising models or Boltzmann machines. There are quite a few classic approximations to the function, using simple polynomials, etc. that can be used in neural networks.

Today, however, I was faced with a sampling problem involving the repeated use of the sigmoid function, and I noticed a simple trick that could reduce the number of sigmoids by about 88% without introducing any approximation. The particular details of the situation aren’t interesting, but I repeatedly needed to do something like the following:

- Input
- Compute a random number
- If
- Output
- Else
- Output

Now, let’s assume for simplicity that is positive. (Otherwise, sample using and then switch the sign of the output.) There are two observations to make:

- If is large, then you are likely to output
- Otherwise, there are easy upper and lower bounds on the probability of outputting

This leads to the following algorithm:

- Input
- Compute a random number
- If
- If or
- Output
- Else
- Output
- Else
- If
- Output
- Else if or
- Output
- Else
- Output

The idea is as follows:

- If , then we can lower-bound the probability of outputting +1 by a pre-computed value of , and short-circuit the computation in many cases.
- If , then we can upper bound the sigmoid function by .
- If , then we can also lower bound by respectively. (This constant was found numerically).

The three cases are illustrated in the following figure, where the input is on the x-axis, and the random number is on the y-axis.

Since, for all at least a fraction of the numbers will be short-circuited, sigmoid calls will be reduced appropriately. If is often near zero, you will do even better.

Obviously, you can take this farther by adding more cases, which may or may not be helpful, depending on the architecture, and the cost of branching vs. the cost of computing an .