## Reducing Sigmoid computations by (at least) 88.0797077977882%

A classic implementation issue in machine learning is reducing the cost of computing the sigmoid function

$\sigma(a) = \frac{1}{1+\exp(-a)}$.

Specifically, it is common to profile your code and discover that 90% of the time is spent computing the $\exp$ in that function.  This comes up often in neural networks, as well as in various probabilistic architectures, such as sampling from Ising models or Boltzmann machines.  There are quite a few classic approximations to the function, using simple polynomials, etc. that can be used in neural networks.

Today, however, I was faced with a sampling problem involving the repeated use of the sigmoid function, and I noticed a simple trick that could reduce the number of sigmoids by about 88% without introducing any approximation.  The particular details of the situation aren’t interesting, but I repeatedly needed to do something like the following:

1. Input $a \in \Re$
2. Compute a random number $r \in [0,1]$
3. If $r < \sigma(a)$
4.   Output $+1$
5. Else
6.     Output $-1$

Now, let’s assume for simplicity that $a$ is positive.  (Otherwise, sample using $-a$ and then switch the sign of the output.)  There are two observations to make:

1. If $a$ is large, then you are likely to output $+1$
2. Otherwise, there are easy upper and lower bounds on the probability of outputting $+1$

This leads to the following algorithm:

1. Input $a \in \Re$
2. Compute a random number $r \in [0,1]$
3. If $a \geq 2$
4.     If $r \leq 0.880797077977882$ or $r \leq \sigma(a)$
5.         Output $+1$
6.     Else
7.         Output $-1$
8. Else
9.     If $r > .5 + a/4$
10.         Output $-1$
11.     Else if $r \leq .5 + a/5.252141128658$ or $r \leq \sigma(a)$
12.         Output $+1$
13.     Else
14.         Output $-1$

The idea is as follows:

1. If $a\geq 2$, then we can lower-bound the probability of outputting +1 by a pre-computed value of $\sigma(2)\approx0.8807...$, and short-circuit the computation in many cases.
2. If $a\leq 2$, then we can upper bound the sigmoid function by $.5+a/4$.
3. If $a\leq 2$, then we can also lower bound by $.5+a/5.252141...$ respectively.  (This constant was found numerically).

The three cases are illustrated in the following figure, where the input $a$ is on the x-axis, and the random number $r$ is on the y-axis.

Since, for all $a$ at least a fraction $\sigma(2)\approx.8807$ of the numbers will be short-circuited, sigmoid calls will be reduced appropriately.  If $a$ is often near zero, you will do even better.

Obviously, you can take this farther by adding more cases, which may or may not be helpful, depending on the architecture, and the cost of branching vs. the cost of computing an $\exp$.

Posted in Uncategorized | | 9 Comments

## Windows Binaries for CRF toolbox

I’m happy to report that, thanks to Alexei Skurikhin, Windows binaries are now available for all the functions in my CRF toolbox.  Hopefully this makes for an easier out of the box experience from Windows folks who, based on my email, have struggled to compile things in the past.

## Etymology of Latin words/abbreviations

I often find the academic convention of using Latin words and abbreviations a bit silly.  (Though I’m as guilty as anyone.) Today, I decided to look into a couple common ones, to see if there is any good reason not to use the ordinary English equivalent.

• “e.g.” is an abbreviation of the Latin “exemplī grātiā”, meaning “for example”.  Here, “exemplī” is not hard to guess and “grātiā” means “for the sake of”.
• “i.e.” is an abbreviation of the Latin “id est”, meaning “that is”.
• “via”.  This is not an abbreviation, but was the Latin word for a road.  The plural is “viae”, which you should probably work into your next paper submission.  (I think the usage would be “Linear convergence and entropic sparsity control viae quasi-stochasticity and entropy regulators”?)
• “et al.”.  Here “et” is Latin for “and” while “al.” is an abreviation for “alii”, meaning “others”.  I seem to recall a major statistics journal maintaining the noble tradition of actually writing “and others”, but can’t recall which one.

I see no particular reason we haven’t adopted English words for all these conventions, but in the context of the many local optima society is in… this hardly seems worth changing.

## Julia, Matlab, and C

Julia is a new language in the same arena as Matlab or R. I’ve had failed attempts to quit the Matlab addiction in the past, making me generally quite conservative about new platforms. However, I’ve recently been particularly annoyed by Matlab’s slow speed, evil license manager errors, restrictions on parallel processes, C++ .mex file pain, etc., and so I decided to check it out. It seems inevitable that Matlab will eventually displaced by something. The question is: is that something Julia?

“We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.”

Essentially, the goal seems to be a faster, freer Matlab that treats users like adults (macros!) and doesn’t require writing any .mex files in C++ or Fortan. Sounds too good to be true? I decided to try it out. My comparisons are to Matlab (R2012a) and C (gcc 4.2 with -O2 and -fno-builtin to prevent compile-time computations) all on a recent MacBook Pro (2.3 GHz Intel Core i7 w/ 8GB RAM).

Installation was trivial: I just grabbed a pre-compiled binary and started it up.

I deliberately used naive algorithms, since I am just testing raw speed. It should be a fair comparison, as long as the algorithm is constant. Please let me know about any bugs, though.

## First benchmark: Fibonnaci

% Matlab
function f=fib(n)
if n <= 2
f=1.0;
else
f=fib(n-1)+fib(n-2);
end
end

// C
double fib(int n){
if(n<=2)
return(1.0);
else
return(fib(n-2)+fib(n-1));
}

% julia
function fib(n)
if n <= 2
1.0
else
fib(n-1)+fib(n-2);
end
end


Clarity is basically a tie. Running them for n=30 we get:

time in matlab (fib): 14.344231
time in c (fib):       0.005887
time in julia (fib):   0.237832

## Second benchmark: Matrix Multiplication

This is a test of the naive O(N^3) matrix multiplication algorithm.

% matlab
function C=mmult(A,B,C)
[M,N] = size(A);
for i=1:M
for j=1:M
for k=1:M
C(i,j) = C(i,j) + A(i,k)*B(k,j);
end
end
end
end

// C
#define M 500
void mmult(double A[M][M],double B[M][M], double C[M][M]){
//double C[M][M];
int i,j,k;
for(i=0; i<M; i++)
for(j=0; j<M; j++){
C[i][j] = 0;
for(k=0; k<M; k++)
C[i][j] += A[i][k]*B[k][j];
}
}

# julia
function mmult(A,B)
(M,N) = size(A);
C = zeros(M,M);
for i=1:M
for j=1:M
for k=1:M
C[i,j] += A[i,k]*B[k,j];
end
end
end
C;
end


Here, I think that Matlab and Julia and a bit clearer, and Julia wins though the wonders of having “+=”. The timing results on 500×500 matrices are:

time in matlab (matmult): 1.229571
time in c (matmult):      0.157658
time in julia (matmult):  0.5029549

## Third Benchmark: numerical quadrature

Here, we attempt to calculate the integral $\int_{x=5}^{15} sin(x) dx$ by numerical quadrature, using a simple midpoint rule with computations at $10^7$ points.

% matlab
val = 0.0;
for x=lb:(ub-lb)/npoints:ub
val = val + sin(x)/npoints;
end
end

// C
double numquad(double lb,double ub,int npoints){
double val = 0.0;
int i;
for(i=0; i<=npoints; i++){
double x = lb + (ub-lb)*i/npoints;
val += sin(x)/npoints;
}
return(val);
}

# julia
val = 0.0
for x=lb:(ub-lb)/npoints:ub
val += sin(x)/npoints
end
val
end

The timings are:

time in matlab (numquad): 0.446151
time in c (numquad):      0.167112
time in julia (numquad):  0.256597

## Fourth Benchmark: Belief Propagation

Finally, I decided to try a little algorithm similar to what I actually tend to implement for my research. Roughly speaking, Belief Propagation is a repeated sequence of matrix multiplications, followed by normalization.

% matlab
function x=beliefprop(A,x,N)
for i=1:N
x = A*x;
x = x/sum(x);
end
end

// C
void beliefprop(double A[25][25], double x[25], int N){
int i, n, j;
double x2[25];
for(n=0; n<N; n++){
for(i=0; i<25; i++){
x2[i]=0;
for(j=0; j<25; j++)
x2[i] += A[i][j]*x[j];
}
for(i=0; i<25; i++)
x[i]=x2[i];
double mysum = 0;
for(i=0; i<25; i++)
mysum += x[i];
for(i=0; i<25; i++)
x[i] /= mysum;
}
return;
}

% julia
function beliefprop(A,x,N)
for i=1:N
x = A*x;
x /= sum(x);
end
x
end

Here, I think we can agree that Matlab and Julia are clearer. (Please don’t make fun of me for hardcoding the 25 dimensions in C.) Using a matrix package for C would probably improve clarity, but perhaps also slow things down. The results are:

time in matlab (beliefprop): 0.627478
time in c (beliefprop):      0.074355
time in julia (beliefprop):  0.376427

## Fifth Benchmark: BP in log-space

In practice, Belief Propagation is often implemented in log-space (to help avoid numerical under/over-flow.). To simulate an algorithm like this, I tried changing to propagation to take an exponent before multiplication, and a logarithm before storage.

% matlab
function x=beliefprop2(A,x,N)
for i=1:N
x = log(A*exp(x));
x = x - log(sum(exp(x)));
end
end

// C
void beliefprop2(double A[25][25], double x[25], int N){
int i, n, j;
double x2[25];
for(n=0; n<N; n++){
for(i=0; i<25; i++){
x2[i]=0;
for(j=0; j<25; j++)
x2[i] += A[i][j]*exp(x[j]);
}
for(i=0; i<25; i++)
x[i]=log(x2[i]);
double mysum = 0;
for(i=0; i<25; i++)
mysum += exp(x[i]);
double mynorm = log(mysum);
for(i=0; i<25; i++)
x[i] -= mynorm;
}
return;
}

# julia
function beliefprop2(A,x,N)
for i=1:N
x = log(A*exp(x));
x -= log(sum(exp(x)));
end
x
end

Life is too short to write C code like that when not necessary. But how about the speed, you ask?

time in matlab (beliefprop2): 0.662761
time in c (beliefprop2):      0.657620
time in julia (beliefprop2):  0.530220

## Sixth Benchmark: Markov Chain Monte Carlo

Here, I implement a simple Metropolis algorithm. For no particular reason, I use the two-dimensional distribution:

$p(x) \propto \exp(\sin(5 x_1) - x_1^2 - x_2^2)$

% matlab
function mcmc(x,N)
f = @(x) exp(sin(x(1)*5) - x(1)^2 - x(2)^2);
p = f(x);
for n=1:N
x2 = x + .01*randn(size(x));
p2 = f(x2);
if rand < p2/p
x = x2;
p = p2;
end
end
end

// C
double f(double *x){
return exp(sin(x[0]*5) - x[0]*x[0] - x[1]*x[1]);
}
#define pi 3.141592653589793
void mcmc(double *x,int N){
double p = f(x);
int n;
double x2[2];
for(n=0; n<N; n++){
// run Box_Muller to get 2 normal random variables
double U1 = ((double)rand())/RAND_MAX;
double U2 = ((double)rand())/RAND_MAX;
double R1 = sqrt(-2*log(U1))*cos(2*pi*U2);
double R2 = sqrt(-2*log(U1))*sin(2*pi*U2);
x2[0] = x[0] + .01*R1;
x2[1] = x[1] + .01*R2;
double p2 = f(x2);
if(((double)rand())/RAND_MAX< p2/p){
x[0] = x2[0];
x[1] = x2[1];
p = p2;
}
}
}

% julia
function mcmc(x,N)
f(x) = exp(sin(x[1]*5) - x[1]^2 - x[2]^2);
p = f(x);
for n=1:N
x2 = x + .01*randn(size(x));
p2 = f(x2);
if rand() < p2/p
x = x2;
p = p2;
end
end
end


Again, I think that C is far less clear than Matlab or Julia. The timings are:

time in matlab (mcmc): 7.747716
time in c (mcmc):      0.150776
time in julia (mcmc):  0.479628

## Table

All times are in seconds. (Lower is better.)

           Matlab     C     Julia
fib        14.344   0.005   0.237
matmult     1.229   0.157   0.502
numquad     0.446   0.167   0.256
bp          0.627   0.074   0.376
bp2         0.662   0.657   0.530
mcmc        7.747   0.150   0.479

## Conclusions

I’m sure all these programs can be sped up. In particular, I’d bet that an expert could optimize the C code to beat Julia on bp2 and mcmc. These are a test of “how fast can Justin Domke make these programs”, not the intrinsic capabilities of the languages. That said, Julia allows for optional type declarations. I did experiment with these but found absolutely no speed improvement. (Which is a good or a bad thing, depending on how you look at life.)

Another surprise to me was how often Matlab’s JIT managed a speed within a reasonable factor of C. (Except when it didn’t…)

The main thing that at Matlab programmer will miss in Julia is undoubtedly plotting. The Julia designers seem to understand the importance of this (“non-negotiable”). If Julia equalled Matlab’s plotting facilities, Matlab would be in real trouble!

Overall, I think that the killer features of freedom, kinda-sorta-C-like speed, and ease of use make Julia more likely as a Matlab-killer than other projects such as R, Sage, Octave, Scipy, etc. (Not to say that those projects have not succeeded in other ways!) Though Julia’s designers also seem to be targeting current R users, my guess is that they will have more success with Matlab folks in the short term, since most Matlab functionality (other than plotting) already exists, while reproducing R’s statistical libraries will be quite difficult. I also think that Julia would be very attractive to current users of languages like Lush. Just to never write another .mex file, I’ll very seriously consider Julia for new projects. Other benefits such as macros, better parallelism support are just bonuses. As Julia continues to develop, it will become yet more attractive.

There was an interesting discussion on Lambda the Ultimate about Julia back when it was announced

Posted in Uncategorized | Tagged , , , | 34 Comments

## CRF Toolbox Updated

I updated the code for my Graphical Models / Conditional Random Fields toolbox This is a Matlab toolbox, though almost all the real work is done in compiled C++ for efficiency. The main improvements are:

• Lots of bugfixes.
• Various small improvements in speed.
• A unified CRF training interface to make things easier for those not training on images
• Binaries are now provided for Linux as well as OS X.
• The code for inference and learning using TRW is now multithreaded, using openmp.
• Switched to using a newer version of Eigen

There is also far more detailed examples, including a full tutorial of how to train a CRF to do “semantic segmentation” on the Stanford Backgrounds dataset. Just using simple color, position, and Histogram of Gradient features, the error rates are 23%, which appear to be state of the art (and better than previous CRF based approaches.) It takes about 90 minutes to train on my 8-core machine, and processes new frames in a little over a second each.

For fun, I also ran this model on a video of someone driving from Alexandria into Georgetown. You can see that the results are far from perfect but are reasonably good. (Notice it successfully distinguishes trees and grass at 0:12)

I’m keen to have others use the code (what with the hundreds of hours spent writing it), so please send email if you have any issues.

Posted in Uncategorized | | 69 Comments

## Personal opinions about graphical models 1: The surrogate likelihood exists and you should use it.

When talking about graphical models with people  (particularly computer vision folks) I find myself advancing a few opinions over and over again.  So, in an effort to stop bothering people at conferences, I thought I’d write a few entries here.

The first thing I’d like to discuss is “surrogate likelihood” training.  (So far as I know, Martin Wainwright was the first person to give a name to this method.)

### Background

Suppose we want to fit a Markov random field (MRF).  I’m writing this as a generative model with an MRF for simplicity– pretty much the same story holds with a Conditional Random Field in the discriminative setting.

$p({\bf x}) = \frac{1}{Z} \prod_{c} \psi({\bf x}_c) \prod_i \psi(y_i)$

Here, the first product is over all cliques/factors in the graph, and the second is over all single variables.  Now, it is convenient to note that MRFs can be seen as members of the exponential family

$p({\bf x};{\boldsymbol \theta}) = \exp( {\boldsymbol \theta} \cdot {\bf f}({\bf x}) - A({\boldsymbol \theta}) )$,

where

${\bf f}({\bf X})=\{I[{\bf X}_{c}={\bf x}_{c}]|\forall c,{\bf x}_{c}\}\cup\{I[X_{i}=x_{i}]|\forall i,x_{i} \}$

is a function consisting of indicator functions for each possible configuration of each clique and variable, and the log-partition function

$A(\boldsymbol{\theta})=\log\sum_{{\bf x}}\exp\boldsymbol{\theta}\cdot{\bf f}({\bf x})$.

ensures normalization.

Now, the log-partition function has the very important (and easy to show) property that the gradient is the expected value of $\bf f$.

$\displaystyle \frac{dA}{d{\boldsymbol \theta}} = \sum_{\bf x} p({\bf x};{\boldsymbol \theta}) {\bf f}({\bf x})$

With a graphical model, what does this mean?  Well, notice that the expected value of, say, $I[X_i=x_i]$ will be exactly $p(x_i;{\boldsymbol \theta})$. Thus, the expected value of ${\bf f}$ will be a vector containing all univariate and clique-wise marginals.  If we write this as ${\boldsymbol \mu}({\boldsymbol \theta})$, then we have

$\displaystyle \frac{dA}{d{\boldsymbol \theta}} = {\boldsymbol \mu}({\boldsymbol \theta})$.

### The usual story

Suppose we want to do maximum likelihood learning.  This means we want to set ${\boldsymbol \theta}$ to maximize

$L( {\boldsymbol \theta} ) = \frac{1}{N}\sum_{\hat{{\bf x}}}\log p({\bf x};{\boldsymbol \theta})={\boldsymbol \theta}\cdot\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}})-A({\boldsymbol \theta}).$

If we want to use gradient ascent, we would just take a small step along the gradient.  This has a very intuitive form: it is the difference of the expected value of $\bf f$ under the model to the expected value of $\bf f$ under the current distribution.

$\displaystyle \frac{dL}{d{\boldsymbol \theta}} = \frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - \sum_{\bf x} p({\bf x};{\boldsymbol \theta}) {\bf f}({\bf x})$.

$\displaystyle \frac{dL}{d{\boldsymbol \theta}} = \frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - {\boldsymbol \mu}({\boldsymbol \theta})$.

Note the lovely property of moment matching here. If we have found a solution, then $dL/d{\boldsymbol \theta}=0$ and so the expected value of $\bf f$ under the current distribution will be exactly equal to that under the data.

Unfortunately, in a high-treewidth setting, we can’t compute the marginals.  That’s too bad.  However, we have all these lovely approximate inference algorithms (loopy belief propagation, tree-reweighted belief propagation, mean field, etc.).  Suppose we write the resulting approximate marginals as $\tilde{{\boldsymbol \mu}}({\boldsymbol \theta})$.  Then, instead of taking the above gradient step, why not instead just use

$\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - \tilde{{\boldsymbol \mu}}({\boldsymbol \theta})$?

That’s all fine!  However, I often see people say/imply/write some or all of the following:

1. This is not guaranteed to converge.
2. There is no longer any well-defined objective function being maximized.
3. We can’t use line searches.
4. We have to use (possibly stochastic) gradient ascent.
5. This whole procedure is frightening and shouldn’t be mentioned in polite company.

I agree that we should view this procedure with some suspicion, but it gets far more than it deserves! The first four points, in my view, are simply wrong.

### What’s missing

The critical thing that is missing from the above story is this: Approximate marginals come together with an approximate partition function!

That is, if you are computing approximate marginals $\tilde{{\boldsymbol \mu}}({\boldsymbol \theta})$ using loopy belief propagation, mean-field, or tree-reweighted belief propagation, there is a well-defined approximate log-partition function $\tilde{A}({\boldsymbol \theta})$ such that

$\displaystyle \tilde{{\boldsymbol \mu}}({\boldsymbol \theta}) = \frac{d\tilde{A}}{d{\boldsymbol \theta}}$.

What this means is that you should think, not of approximating the likelihood gradient, but of approximating the likelihood itself. Specifically, what the above is really doing is optimizing the “surrogate likelihood”

$\tilde{L}({\boldsymbol \theta}) = {\boldsymbol \theta}\cdot\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}})-\tilde{A}({\boldsymbol \theta}).$

What’s the gradient of this? It is

$\frac{1}{N}\sum_{\hat{{\bf x}}}{\bf f}(\hat{{\bf x}}) - \tilde{{\boldsymbol \mu}}({\boldsymbol \theta}),$

or exactly the gradient that was being used above. The advantage of doing things this way is that it is a normal optimization.  There is a well-defined objective. It can be plugged into a standard optimization routine, such as BFGS, which will probably be faster than gradient ascent.  Line searches guarantee convergence. $\tilde{A}$ is perfectly tractable to compute. In fact, if you have already computed approximate marginals, $\tilde{A}$ has almost no cost. Life is good.

The only counterargument I can think of is that mean-field and loopy BP can have different local optima, which might mean that a no-line-search-refuse-to-look-at-the-objective-function-just-follow-the-gradient-and-pray style optimization could be more robust, though I’d like to see that argument made…

I’m not sure of the history, but I think part of the reason this procedure has such a bad reputation (even from people that use it!) might be that it predates the “modern” understanding of inference procedures as producing approximate partition functions as well as approximate marginals.

Posted in Uncategorized | | 7 Comments

## Functions

I was pleased the other day to have cause to discover that this is valid matlab syntax:

make_energy_fun = @(x,f) @(y,w) energy(y,x,f,w);

My thoughts:

• This is great!  I’m creating an anonymous function which takes two parameters and returns an anonymous function taking two parameters which evaluates the energy with the original two parameters baked in. Then the (original) anonymous function is assigned to the variable make_energy_fun.
• But… in a civilized programming language wouldn’t this be, literally, an easy problem on a freshman problem set?
• …Why is it that we don’t use civilized programming languages, again?
Posted in Uncategorized | 2 Comments