Julia is a new language in the same arena as Matlab or R. I’ve had failed attempts to quit the Matlab addiction in the past, making me generally quite conservative about new platforms. However, I’ve recently been particularly annoyed by Matlab’s slow speed, evil license manager errors, restrictions on parallel processes, C++ .mex file pain, etc., and so I decided to check it out. It seems inevitable that Matlab will eventually displaced by something. The question is: is that something Julia?
“We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.”
Essentially, the goal seems to be a faster, freer Matlab that treats users like adults (macros!) and doesn’t require writing any .mex files in C++ or Fortan. Sounds too good to be true? I decided to try it out. My comparisons are to Matlab (R2012a) and C (gcc 4.2 with O2 and fnobuiltin to prevent compiletime computations) all on a recent MacBook Pro (2.3 GHz Intel Core i7 w/ 8GB RAM).
Installation was trivial: I just grabbed a precompiled binary and started it up.
I deliberately used naive algorithms, since I am just testing raw speed. It should be a fair comparison, as long as the algorithm is constant. Please let me know about any bugs, though.
Click here to skip to the results.
First benchmark: Fibonnaci
% Matlab
function f=fib(n)
if n <= 2
f=1.0;
else
f=fib(n1)+fib(n2);
end
end
// C
double fib(int n){
if(n<=2)
return(1.0);
else
return(fib(n2)+fib(n1));
}
% julia
function fib(n)
if n <= 2
1.0
else
fib(n1)+fib(n2);
end
end
Clarity is basically a tie. Running them for n=30 we get:
time in matlab (fib): 14.344231
time in c (fib): 0.005887
time in julia (fib): 0.237832
Second benchmark: Matrix Multiplication
This is a test of the naive O(N^3) matrix multiplication algorithm.
% matlab
function C=mmult(A,B,C)
[M,N] = size(A);
for i=1:M
for j=1:M
for k=1:M
C(i,j) = C(i,j) + A(i,k)*B(k,j);
end
end
end
end
// C
#define M 500
void mmult(double A[M][M],double B[M][M], double C[M][M]){
//double C[M][M];
int i,j,k;
for(i=0; i<M; i++)
for(j=0; j<M; j++){
C[i][j] = 0;
for(k=0; k<M; k++)
C[i][j] += A[i][k]*B[k][j];
}
}
# julia
function mmult(A,B)
(M,N) = size(A);
C = zeros(M,M);
for i=1:M
for j=1:M
for k=1:M
C[i,j] += A[i,k]*B[k,j];
end
end
end
C;
end
Here, I think that Matlab and Julia and a bit clearer, and Julia wins though the wonders of having “+=”. The timing results on 500×500 matrices are:
time in matlab (matmult): 1.229571
time in c (matmult): 0.157658
time in julia (matmult): 0.5029549
Third Benchmark: numerical quadrature
Here, we attempt to calculate the integral by numerical quadrature, using a simple midpoint rule with computations at points.
% matlab
function val=numquad(lb,ub,npoints)
val = 0.0;
for x=lb:(ublb)/npoints:ub
val = val + sin(x)/npoints;
end
end
// C
double numquad(double lb,double ub,int npoints){
double val = 0.0;
int i;
for(i=0; i<=npoints; i++){
double x = lb + (ublb)*i/npoints;
val += sin(x)/npoints;
}
return(val);
}
# julia
function numquad(lb,ub,npoints)
val = 0.0
for x=lb:(ublb)/npoints:ub
val += sin(x)/npoints
end
val
end
The timings are:
time in matlab (numquad): 0.446151
time in c (numquad): 0.167112
time in julia (numquad): 0.256597
Fourth Benchmark: Belief Propagation
Finally, I decided to try a little algorithm similar to what I actually tend to implement for my research. Roughly speaking, Belief Propagation is a repeated sequence of matrix multiplications, followed by normalization.
% matlab
function x=beliefprop(A,x,N)
for i=1:N
x = A*x;
x = x/sum(x);
end
end
// C
void beliefprop(double A[25][25], double x[25], int N){
int i, n, j;
double x2[25];
for(n=0; n<N; n++){
for(i=0; i<25; i++){
x2[i]=0;
for(j=0; j<25; j++)
x2[i] += A[i][j]*x[j];
}
for(i=0; i<25; i++)
x[i]=x2[i];
double mysum = 0;
for(i=0; i<25; i++)
mysum += x[i];
for(i=0; i<25; i++)
x[i] /= mysum;
}
return;
}
% julia
function beliefprop(A,x,N)
for i=1:N
x = A*x;
x /= sum(x);
end
x
end
Here, I think we can agree that Matlab and Julia are clearer. (Please don’t make fun of me for hardcoding the 25 dimensions in C.) Using a matrix package for C would probably improve clarity, but perhaps also slow things down. The results are:
time in matlab (beliefprop): 0.627478
time in c (beliefprop): 0.074355
time in julia (beliefprop): 0.376427
Fifth Benchmark: BP in logspace
In practice, Belief Propagation is often implemented in logspace (to help avoid numerical under/overflow.). To simulate an algorithm like this, I tried changing to propagation to take an exponent before multiplication, and a logarithm before storage.
% matlab
function x=beliefprop2(A,x,N)
for i=1:N
x = log(A*exp(x));
x = x  log(sum(exp(x)));
end
end
// C
void beliefprop2(double A[25][25], double x[25], int N){
int i, n, j;
double x2[25];
for(n=0; n<N; n++){
for(i=0; i<25; i++){
x2[i]=0;
for(j=0; j<25; j++)
x2[i] += A[i][j]*exp(x[j]);
}
for(i=0; i<25; i++)
x[i]=log(x2[i]);
double mysum = 0;
for(i=0; i<25; i++)
mysum += exp(x[i]);
double mynorm = log(mysum);
for(i=0; i<25; i++)
x[i] = mynorm;
}
return;
}
# julia
function beliefprop2(A,x,N)
for i=1:N
x = log(A*exp(x));
x = log(sum(exp(x)));
end
x
end
Life is too short to write C code like that when not necessary. But how about the speed, you ask?
time in matlab (beliefprop2): 0.662761
time in c (beliefprop2): 0.657620
time in julia (beliefprop2): 0.530220
Sixth Benchmark: Markov Chain Monte Carlo
Here, I implement a simple Metropolis algorithm. For no particular reason, I use the twodimensional distribution:
% matlab
function mcmc(x,N)
f = @(x) exp(sin(x(1)*5)  x(1)^2  x(2)^2);
p = f(x);
for n=1:N
x2 = x + .01*randn(size(x));
p2 = f(x2);
if rand < p2/p
x = x2;
p = p2;
end
end
end
// C
double f(double *x){
return exp(sin(x[0]*5)  x[0]*x[0]  x[1]*x[1]);
}
#define pi 3.141592653589793
void mcmc(double *x,int N){
double p = f(x);
int n;
double x2[2];
for(n=0; n<N; n++){
// run Box_Muller to get 2 normal random variables
double U1 = ((double)rand())/RAND_MAX;
double U2 = ((double)rand())/RAND_MAX;
double R1 = sqrt(2*log(U1))*cos(2*pi*U2);
double R2 = sqrt(2*log(U1))*sin(2*pi*U2);
x2[0] = x[0] + .01*R1;
x2[1] = x[1] + .01*R2;
double p2 = f(x2);
if(((double)rand())/RAND_MAX< p2/p){
x[0] = x2[0];
x[1] = x2[1];
p = p2;
}
}
}
% julia
function mcmc(x,N)
f(x) = exp(sin(x[1]*5)  x[1]^2  x[2]^2);
p = f(x);
for n=1:N
x2 = x + .01*randn(size(x));
p2 = f(x2);
if rand() < p2/p
x = x2;
p = p2;
end
end
end
Again, I think that C is far less clear than Matlab or Julia. The timings are:
time in matlab (mcmc): 7.747716
time in c (mcmc): 0.150776
time in julia (mcmc): 0.479628
Table
All times are in seconds. (Lower is better.)
Matlab C Julia
fib 14.344 0.005 0.237
matmult 1.229 0.157 0.502
numquad 0.446 0.167 0.256
bp 0.627 0.074 0.376
bp2 0.662 0.657 0.530
mcmc 7.747 0.150 0.479
Conclusions
I’m sure all these programs can be sped up. In particular, I’d bet that an expert could optimize the C code to beat Julia on bp2
and mcmc
. These are a test of “how fast can Justin Domke make these programs”, not the intrinsic capabilities of the languages. That said, Julia allows for optional type declarations. I did experiment with these but found absolutely no speed improvement. (Which is a good or a bad thing, depending on how you look at life.)
Another surprise to me was how often Matlab’s JIT managed a speed within a reasonable factor of C. (Except when it didn’t…)
The main thing that at Matlab programmer will miss in Julia is undoubtedly plotting. The Julia designers seem to understand the importance of this (“nonnegotiable”). If Julia equalled Matlab’s plotting facilities, Matlab would be in real trouble!
Overall, I think that the killer features of freedom, kindasortaClike speed, and ease of use make Julia more likely as a Matlabkiller than other projects such as R, Sage, Octave, Scipy, etc. (Not to say that those projects have not succeeded in other ways!) Though Julia’s designers also seem to be targeting current R users, my guess is that they will have more success with Matlab folks in the short term, since most Matlab functionality (other than plotting) already exists, while reproducing R’s statistical libraries will be quite difficult. I also think that Julia would be very attractive to current users of languages like Lush. Just to never write another .mex file, I’ll very seriously consider Julia for new projects. Other benefits such as macros, better parallelism support are just bonuses. As Julia continues to develop, it will become yet more attractive.
There was an interesting discussion on Lambda the Ultimate about Julia back when it was announced
When talking about graphical models with people (particularly computer vision folks) I find myself advancing a few opinions over and over again. So, in an effort to stop bothering people at conferences, I thought I’d write a few entries here.
The first thing I’d like to discuss is “surrogate likelihood” training. (So far as I know, Martin Wainwright was the first person to give a name to this method.)
Background
Suppose we want to fit a Markov random field (MRF). I’m writing this as a generative model with an MRF for simplicity– pretty much the same story holds with a Conditional Random Field in the discriminative setting.
Here, the first product is over all cliques/factors in the graph, and the second is over all single variables. Now, it is convenient to note that MRFs can be seen as members of the exponential family
,
where
is a function consisting of indicator functions for each possible configuration of each clique and variable, and the logpartition function
.
ensures normalization.
Now, the logpartition function has the very important (and easy to show) property that the gradient is the expected value of .
With a graphical model, what does this mean? Well, notice that the expected value of, say, will be exactly . Thus, the expected value of will be a vector containing all univariate and cliquewise marginals. If we write this as , then we have
.
The usual story
Suppose we want to do maximum likelihood learning. This means we want to set to maximize
If we want to use gradient ascent, we would just take a small step along the gradient. This has a very intuitive form: it is the difference of the expected value of under the model to the expected value of under the current distribution.
.
.
Note the lovely property of moment matching here. If we have found a solution, then and so the expected value of under the current distribution will be exactly equal to that under the data.
Unfortunately, in a hightreewidth setting, we can’t compute the marginals. That’s too bad. However, we have all these lovely approximate inference algorithms (loopy belief propagation, treereweighted belief propagation, mean field, etc.). Suppose we write the resulting approximate marginals as . Then, instead of taking the above gradient step, why not instead just use
?
That’s all fine! However, I often see people say/imply/write some or all of the following:
 This is not guaranteed to converge.
 There is no longer any welldefined objective function being maximized.
 We can’t use line searches.
 We have to use (possibly stochastic) gradient ascent.
 This whole procedure is frightening and shouldn’t be mentioned in polite company.
I agree that we should view this procedure with some suspicion, but it gets far more than it deserves! The first four points, in my view, are simply wrong.
What’s missing
The critical thing that is missing from the above story is this: Approximate marginals come together with an approximate partition function!
That is, if you are computing approximate marginals using loopy belief propagation, meanfield, or treereweighted belief propagation, there is a welldefined approximate logpartition function such that
.
What this means is that you should think, not of approximating the likelihood gradient, but of approximating the likelihood itself. Specifically, what the above is really doing is optimizing the “surrogate likelihood”
What’s the gradient of this? It is
or exactly the gradient that was being used above. The advantage of doing things this way is that it is a normal optimization. There is a welldefined objective. It can be plugged into a standard optimization routine, such as BFGS, which will probably be faster than gradient ascent. Line searches guarantee convergence. is perfectly tractable to compute. In fact, if you have already computed approximate marginals, has almost no cost. Life is good.
The only counterargument I can think of is that meanfield and loopy BP can have different local optima, which might mean that a nolinesearchrefusetolookattheobjectivefunctionjustfollowthegradientandpray style optimization could be more robust, though I’d like to see that argument made…
I’m not sure of the history, but I think part of the reason this procedure has such a bad reputation (even from people that use it!) might be that it predates the “modern” understanding of inference procedures as producing approximate partition functions as well as approximate marginals.
It seems to me that thinking of quotients as a fundamental operator is usually painful and unnecessary when the objects are almost anything other than real (or rational) numbers. Instead it is better to think of a quotient as a combination of the reciprocal and the product. A good example of this is complex numbers. Suppose that
Then, the usual rule for the quotient is that
This qualifies as nonmemorizable. On the other hand, take the reciprocal of
.
This is simple enough (“the complex conjugate divided by the squared norm”), and we recover the rule for the quotient easily enough by multiplying with .
The same thing holds true for derivatives. I’ve never been able to remember that quotient rule from highschool; Namely that if , then
Ick. Instead, better to note that if then
along with the standard rule for differentiating products, so that if , then
.
Another case would be the “matrix quotient” . Of course, everyone already thinks of the matrix multiplication and inverse as separate operations– to do otherwise would be horrible– but I think that just proves the point…
(Although, I assume that computing as a single operation would be more numerically stable than first taking an explicit inverse. This might mean something to people who feel that mathematical notation ought to suggest an obvious stable implementation in IEEE floating point (if any).)
Posted in Uncategorized

Tagged math

Here is an example of usage of the graphical models toolbox I’ve just released. I’ll use the terminology of “perturbation” to refer to computing loss gradients from the difference of two problems as in this paper, and “truncated fitting” to refer to backpropagating derivatives through the TRW inference process for a fixed number of iterations, as in this paper.
This is basically the simplest possible vision problem. We will train a conditional random field (CRF) to take some noisy image as input:
and produce marginals that well predict a binary image as output:
The noisy image is produced by setting where is random on and is a noise level as described in this paper. For now, we set which is a pretty challenging amount of noise, as you see above.
The main file, which does the learning described here can be found in the toolbox in examples/train_binary_denoisers.m.
First off, we will train a model using perturbation, with the univariate likelihood loss function, based on TRW inference, with a convergence threshold of 1e5. We do this by typing:
>> train_binary_denoisers('pert_ul_trw_1e5')
Firstorder
Iteration Funccount f(x) Stepsize optimality
0 1 0.692759 0.096
1 3 0.686739 0.432616 0.0225
2 5 0.682054 10 0.0182
3 6 0.670785 1 0.0317
4 10 0.542285 48.8796 0.932
5 12 0.515509 0.1 0.965
6 13 0.439039 1 0.355
7 14 0.302082 1 0.279
8 15 0.228832 1 0.471
9 17 0.223659 0.344464 0.0159
10 18 0.223422 1 0.00417
11 19 0.223231 1 0.00269
12 20 0.223227 1 0.00122
13 22 0.223221 4.33583 0.000857
14 23 0.223201 1 0.00121
15 24 0.223138 1 0.00306
16 25 0.223035 1 0.00509
17 26 0.222903 1 0.00564
18 27 0.222824 1 0.0035
19 28 0.222806 1 0.000945
20 29 0.222803 1 0.000798
21 30 0.222802 1 0.00079
22 31 0.222798 1 0.00111
23 32 0.222788 1 0.00168
24 33 0.222763 1 0.00255
25 34 0.222707 1 0.00364
26 35 0.222603 1 0.00435
27 36 0.222479 1 0.00339
28 37 0.222408 1 0.00117
29 38 0.222394 1 9.64e05
30 39 0.222393 1 2.05e05
31 40 0.222393 1 4.06e06
32 41 0.222393 1 2.86e07
The final model trained has an error rate of 0.096. We can visualize the marginals produced by making an image where each pixel has an intensity proportional to the predicted probability that that pixel takes label “1”.
On the other hand, we might train a model using truncated fitting, with the univariate likelihood, and 5 iterations of TRW.
>> train_binary_denoisers('trunc_ul_trw_5')
Sparing you the details of the optimization, this yields a total error rate of .0984 and the marginals:
Thus, restricting to only 5 iterations pays only a small accuracy penalty compared to a right convergence threshold.
Or, we could fit using the surrogate conditional likelihood. (Here called E.M., though we don’t happen to have any hidden variables.)
>> train_binary_denoisers('em_trw_1e5')
This yields an error rate of .1016, and the marginals:
There are many permutations of loss functions, inference algorithms, etc. (Some of which have not yet been published.) Rather than exhaust all the possibilities, I’ll just list a bunch of examples:
'pert_ul_trw_1e5' (Perturbation + univariate likelihood + TRW with 1e5 threshold)
'trunc_cl_trw_5' (Truncated Fitting + clique likelihood + TRW with 5 iterations)
'trunc_cl_mnf_5' (Truncated Fitting + clique likelihood + Mean Field with 5 iterations)
'trunc_em_trw_5' (Truncated EM, with TRW used to compute both upper and lower bounds on partition function + 5 iterations)
'em_trw_1e5' (Regular EM, with TRW used to compute both upper and lower bounds on partition function + 1e5 threshold)
'em_trw/mnf_1e5' (Regular EM, with TRW used for upper bound and meanfield used for lower bound + 1e5 threshold)
'pseudo' (Pseudolikelihood)
About the pseudolikelihood, let’s try it.
>> train_binary_denoisers(‘pseudo’)
This yields an nearchange error rate of .44, and the marginals (produced by TRW)
Which is why you probably shouldn’t use the pseudolikelihood…