Julia is a new language in the same arena as Matlab or R. I’ve had failed attempts to quit the Matlab addiction in the past, making me generally quite conservative about new platforms. However, I’ve recently been particularly annoyed by Matlab’s slow speed, evil license manager errors, restrictions on parallel processes, C++ .mex file pain, etc., and so I decided to check it out. It seems inevitable that Matlab will eventually displaced by *something*. The question is: is that something Julia?

“We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.”

Essentially, the goal seems to be a faster, freer Matlab that treats users like adults (macros!) and *doesn’t require writing any .mex files in C++ or Fortan*. Sounds too good to be true? I decided to try it out. My comparisons are to Matlab (R2012a) and C (gcc 4.2 with -O2 and -fno-builtin to prevent compile-time computations) all on a recent MacBook Pro (2.3 GHz Intel Core i7 w/ 8GB RAM).

Installation was trivial: I just grabbed a pre-compiled binary and started it up.

I deliberately used naive algorithms, since I am just testing raw speed. It should be a fair comparison, as long as the algorithm is constant. Please let me know about any bugs, though.

Click here to skip to the results.

## First benchmark: Fibonnaci

% Matlab function f=fib(n) if n <= 2 f=1.0; else f=fib(n-1)+fib(n-2); end end // C double fib(int n){ if(n<=2) return(1.0); else return(fib(n-2)+fib(n-1)); } % julia function fib(n) if n <= 2 1.0 else fib(n-1)+fib(n-2); end end

Clarity is basically a tie. Running them for n=30 we get:

time in matlab (fib): 14.344231 time in c (fib): 0.005887 time in julia (fib): 0.237832

## Second benchmark: Matrix Multiplication

This is a test of the naive O(N^3) matrix multiplication algorithm.

% matlab function C=mmult(A,B,C) [M,N] = size(A); for i=1:M for j=1:M for k=1:M C(i,j) = C(i,j) + A(i,k)*B(k,j); end end end end // C #define M 500 void mmult(double A[M][M],double B[M][M], double C[M][M]){ //double C[M][M]; int i,j,k; for(i=0; i<M; i++) for(j=0; j<M; j++){ C[i][j] = 0; for(k=0; k<M; k++) C[i][j] += A[i][k]*B[k][j]; } } # julia function mmult(A,B) (M,N) = size(A); C = zeros(M,M); for i=1:M for j=1:M for k=1:M C[i,j] += A[i,k]*B[k,j]; end end end C; end

Here, I think that Matlab and Julia and a bit clearer, and Julia wins though the wonders of having “+=”. The timing results on 500×500 matrices are:

time in matlab (matmult): 1.229571 time in c (matmult): 0.157658 time in julia (matmult): 0.5029549

## Third Benchmark: numerical quadrature

Here, we attempt to calculate the integral by numerical quadrature, using a simple midpoint rule with computations at points.

% matlab function val=numquad(lb,ub,npoints) val = 0.0; for x=lb:(ub-lb)/npoints:ub val = val + sin(x)/npoints; end end // C double numquad(double lb,double ub,int npoints){ double val = 0.0; int i; for(i=0; i<=npoints; i++){ double x = lb + (ub-lb)*i/npoints; val += sin(x)/npoints; } return(val); } # julia function numquad(lb,ub,npoints) val = 0.0 for x=lb:(ub-lb)/npoints:ub val += sin(x)/npoints end val end

The timings are:

time in matlab (numquad): 0.446151 time in c (numquad): 0.167112 time in julia (numquad): 0.256597

## Fourth Benchmark: Belief Propagation

Finally, I decided to try a little algorithm similar to what I actually tend to implement for my research. Roughly speaking, Belief Propagation is a repeated sequence of matrix multiplications, followed by normalization.

% matlab function x=beliefprop(A,x,N) for i=1:N x = A*x; x = x/sum(x); end end // C void beliefprop(double A[25][25], double x[25], int N){ int i, n, j; double x2[25]; for(n=0; n<N; n++){ for(i=0; i<25; i++){ x2[i]=0; for(j=0; j<25; j++) x2[i] += A[i][j]*x[j]; } for(i=0; i<25; i++) x[i]=x2[i]; double mysum = 0; for(i=0; i<25; i++) mysum += x[i]; for(i=0; i<25; i++) x[i] /= mysum; } return; } % julia function beliefprop(A,x,N) for i=1:N x = A*x; x /= sum(x); end x end

Here, I think we can agree that Matlab and Julia are clearer. (Please don’t make fun of me for hardcoding the 25 dimensions in C.) Using a matrix package for C would probably improve clarity, but perhaps also slow things down. The results are:

time in matlab (beliefprop): 0.627478 time in c (beliefprop): 0.074355 time in julia (beliefprop): 0.376427

## Fifth Benchmark: BP in log-space

In practice, Belief Propagation is often implemented in log-space (to help avoid numerical under/over-flow.). To simulate an algorithm like this, I tried changing to propagation to take an exponent before multiplication, and a logarithm before storage.

% matlab function x=beliefprop2(A,x,N) for i=1:N x = log(A*exp(x)); x = x - log(sum(exp(x))); end end // C void beliefprop2(double A[25][25], double x[25], int N){ int i, n, j; double x2[25]; for(n=0; n<N; n++){ for(i=0; i<25; i++){ x2[i]=0; for(j=0; j<25; j++) x2[i] += A[i][j]*exp(x[j]); } for(i=0; i<25; i++) x[i]=log(x2[i]); double mysum = 0; for(i=0; i<25; i++) mysum += exp(x[i]); double mynorm = log(mysum); for(i=0; i<25; i++) x[i] -= mynorm; } return; } # julia function beliefprop2(A,x,N) for i=1:N x = log(A*exp(x)); x -= log(sum(exp(x))); end x end

Life is too short to write C code like that when not necessary. But how about the speed, you ask?

time in matlab (beliefprop2): 0.662761 time in c (beliefprop2): 0.657620 time in julia (beliefprop2): 0.530220

## Sixth Benchmark: Markov Chain Monte Carlo

Here, I implement a simple Metropolis algorithm. For no particular reason, I use the two-dimensional distribution:

% matlab function mcmc(x,N) f = @(x) exp(sin(x(1)*5) - x(1)^2 - x(2)^2); p = f(x); for n=1:N x2 = x + .01*randn(size(x)); p2 = f(x2); if rand < p2/p x = x2; p = p2; end end end // C double f(double *x){ return exp(sin(x[0]*5) - x[0]*x[0] - x[1]*x[1]); } #define pi 3.141592653589793 void mcmc(double *x,int N){ double p = f(x); int n; double x2[2]; for(n=0; n<N; n++){ // run Box_Muller to get 2 normal random variables double U1 = ((double)rand())/RAND_MAX; double U2 = ((double)rand())/RAND_MAX; double R1 = sqrt(-2*log(U1))*cos(2*pi*U2); double R2 = sqrt(-2*log(U1))*sin(2*pi*U2); x2[0] = x[0] + .01*R1; x2[1] = x[1] + .01*R2; double p2 = f(x2); if(((double)rand())/RAND_MAX< p2/p){ x[0] = x2[0]; x[1] = x2[1]; p = p2; } } } % julia function mcmc(x,N) f(x) = exp(sin(x[1]*5) - x[1]^2 - x[2]^2); p = f(x); for n=1:N x2 = x + .01*randn(size(x)); p2 = f(x2); if rand() < p2/p x = x2; p = p2; end end end

Again, I think that C is far less clear than Matlab or Julia. The timings are:

time in matlab (mcmc): 7.747716 time in c (mcmc): 0.150776 time in julia (mcmc): 0.479628

## Table

All times are in seconds. (Lower is better.)

Matlab C Julia fib 14.344 0.005 0.237 matmult 1.229 0.157 0.502 numquad 0.446 0.167 0.256 bp 0.627 0.074 0.376 bp2 0.662 0.657 0.530 mcmc 7.747 0.150 0.479

## Conclusions

I’m sure all these programs can be sped up. In particular, I’d bet that an expert could optimize the C code to beat Julia on `bp2`

and `mcmc`

. These are a test of “how fast can Justin Domke make these programs”, not the intrinsic capabilities of the languages. That said, Julia allows for optional type declarations. I did experiment with these but found absolutely no speed improvement. (Which is a good or a bad thing, depending on how you look at life.)

Another surprise to me was how often Matlab’s JIT managed a speed within a reasonable factor of C. (Except when it didn’t…)

The main thing that at Matlab programmer will miss in Julia is undoubtedly plotting. The Julia designers seem to understand the importance of this (“non-negotiable”). If Julia equalled Matlab’s plotting facilities, Matlab would be in real trouble!

Overall, I think that the killer features of freedom, kinda-sorta-C-like speed, and ease of use make Julia more likely as a Matlab-killer than other projects such as R, Sage, Octave, Scipy, etc. (Not to say that those projects have not succeeded in other ways!) Though Julia’s designers also seem to be targeting current R users, my guess is that they will have more success with Matlab folks in the short term, since most Matlab functionality (other than plotting) already exists, while reproducing R’s statistical libraries will be quite difficult. I also think that Julia would be very attractive to current users of languages like Lush. Just to never write another .mex file, I’ll very seriously consider Julia for new projects. Other benefits such as macros, better parallelism support are just bonuses. As Julia continues to develop, it will become yet more attractive.

There was an interesting discussion on Lambda the Ultimate about Julia back when it was announced

Hi Justin,

Thanks for the interesting article and for introducing ‘Julia’ to me. It sounds very interesting.

Just a quick question. In benchmark 4 in Julia’s code you are multiplying matrix ‘A’ with ‘exp(x)’, but don’t do that in the matlab and C codes. This might have degraded Julia’s performance, if I’m not mistaken. On my machine, with the ‘exp(x)’ calculation, the code is almost twice as slow, compared to just multiplying by ‘x’.

Best regards and keep up the great blog! Thanks

I’ve also been watching Julia, however since you mentioned Lush I figured I’d also throw out Torch7 (the successor to the successor of Lush). It uses Lua as it’s scripting language (which if you use LuaJIT is also near C speed), plus an SSE & CUDA accelerated Tensor library (along with neural nets, etc.). http://www.torch.ch

I’ve only recently started working with it myself, but I like it so far.

Hi Jan,

You’re absolutely right. I re-ran the benchmark, and updated the code and results above. Strangely, it only slightly increased the speed. Thanks for pointing that out!

Hi again. I think if you increase the value of N to some large number, the difference in speed will become more noticeable. My trivial, and maybe silly, example was ‘A = ones(25,25);’ and ‘x=ones(25,1);’ and ‘N=1000000;’. Without ‘exp’ ~1.5s vs. ~3s with ‘exp’.

Cheers

Your C version of f() for MCMC is defined differently from the other two versions. Perhaps this has some impact on the computation time. The performance of Box-Muller method is also inferior to some other normal random number generation methods (such as the “pyramid” method that Matlab uses). I’ve tried Boost.Random library (www.boost.org) and the C++ version was 20% faster.

Also, in bp2, if you use the Eigen2 or Eigen3 libraries (http://eigen.tuxfamily.org), the resulting code would be much more compact and much faster. On my machine, the CPU time is reduced by 90%. Here’s the Eigen 2 code:

#include

#include

#include

void beliefprop2(

const Eigen::MatrixXd & A,

Eigen::VectorXd & x, int N)

{

for (int n=0; n<N; ++n) {

x = (A * x.cwise().exp()).cwise().log();

x.cwise() -= std::log(x.cwise().exp().sum());

}

return;

}

Thanks for the comments. I accidentally posted an older version of the c code for mcmc than I used for running the benchmarks. So, fortunately, the current (corrected) code corresponds to the original timing.

I am actually a big fan of Eigen, and have been using it to write matlab .mex functions for a few years now. A reduction by 90% is pretty amazing, though. Do you have any insight into if it is the matrix operations, or the exp/log that is getting sped up so much?

Hmm, I get quite different results from you for some of your benchmarks. For example, while you show that Julia is 5x slower than C for bp, for Julia is faster than your C version:

For N = 10000:

Julia: 0.028-0.054s (typically around 0.035s)

C: 0.046-0.053s

The fluctuations in Julia are due to the garbage collector sometimes being called; if you execute gc() before timing your function you get consistent timing between 0.028-0.029, almost 2x faster than the C version. Presumably, this is due to Julia’s using BLAS libraries. In any event, this is not at all consistent with Julia being 5x slower than C on this benchmark.

Are you sure you were measuring the same thing (wall clock time) with both of them? Did you make sure to let Julia run it multiple times without reloading the file? (The first execution will be slow because Julia has to JIT-compile the function.)

If all those things are good, your precompiled binary may not be up to the performance you’d get from a fresh git checkout.

Interesting. With Julia, I get around .04s for N=10000 and around .45s for N=100000, which pretty much fits with your numbers. So, it seems that the question is why C is so slow for you. Did you compile with optimization? (I get .02s for N=10000 with no optimization, and .007s with -O2)

It certainly *could* be that I’m measuring something incorrectly, though I don’t think so. For Julia I do something like this:

tic()

x=beliefprop(A,x,100000);

println(“time in julia (beliefprop): “,toc())

While in C I do something like:

begin = clock();

beliefprop(A2,x,10000);

end = clock();

time_spent = (double)(end – begin) / CLOCKS_PER_SEC;

printf(“time in c (beliefprop): %f \n”, time_spent);

Ah, I forgot -O2 (dumb mistake). That drops the C time a lot, to the point where it wins (~0.02s vs ~0.03s), but it’s still a much closer race between Julia and C than you get.

I’m timing Julia code this way:

@time y = beliefprop(A, x, 10000);

or

gc(); @time y = beliefprop(A, x, 10000);

if you want Julia to start with a clean slate on each run.

Hmm, in the way you time the Julia code, there’s the potential for other things besides the function you’re testing to eat up time before that toc() call. Your C version does it more carefully, which may be another source of disparity. Consistent with this, you may notice that the timing of fib() as listed on the Julia website is far more competitive with C than you are showing, despite the fact that your fib() is basically the same as theirs. I wonder if this comes down to differences in how you measure the timing.

I also noticed that the mmult code includes allocation in the timing for Julia (that “zeros” call allocates the memory and fills it with zeros), but not for any of the others.

As a final tip, if you really want the maximum performance from Julia, it’s possible to do things like this:

A_mul_B(result, A, B)

and it will store the result of A*B in the pre-allocated array “result”. If you wanted to, you could use this to make your Julia beliefprop behave more like your C version (using two pre-allocated arrays to avoid the overhead of allocating temporaries), at some cost in readability.

And one more point…your C timing measures CPU time, your Julia and Matlab versions measure wall clock time. I used gettimeofday to make sure the C version was recording wall clock time, too.

OK! Well, I tried installing Julia from source, which didn’t give any speedup. I also changed the C code to measure times with a harness like:

struct timeval t0;

struct timeval t1;

gettimeofday(&t0, 0);

beliefprop2(A2,x,100000);

gettimeofday(&t1, 0);

printf(“time in c (beliefprop2): %f \n”, t1.tv_sec-t0.tv_sec + (t1.tv_usec-t0.tv_usec)/1000000.0);

and in Julia with something like:

gc()

tic()

mmult(A,B,C)

println(“time in julia (matmult): “,toc());

I tested doing this toc() before the print, but this made no difference. I updated all the numbers above. The call to gc() did reliably speed up almost everything in Julia. However, the times are also now faster for C, meaning overall Julia is looking somewhat worse.

I do realize that my results for Fibonacci are totally different from those on the Julia homepage. If anyone could explain why, I’d be very interested to know.

For fib: there are two main differences between yours and the one here:

https://github.com/JuliaLang/julia/blob/master/test/perf/perf.jl

First, a one-line function definition gets inlined in Julia, whereas a multi-line definition (currently) doesn’t. Second, you may have a type problem: 1.0 is a floating-point type, but your inputs are integers, and that means that type conversions have to take place. The syntax “one(T)” construct 1 in type T.

For bp: I also remembered that Julia’s “sum” currently emphasizes accuracy over speed, and uses the KBN algorithm (http://en.wikipedia.org/wiki/Kahan_summation_algorithm) to correct for roundoff error. Whether to emphasize speed or accuracy has been discussed repeatedly, but there does not seem to be an easy answer, and the current version favors accuracy. Part of the reason is that, if you want speed, you can trivially write the fast version yourself. Combining it with the (nearly) “allocation-free” version, it might look like this:

function beliefprop_noalloc(A,x,N)

x = copy(x) # don’t mutate x

y = similar(x) # use the same temporary each time

for i=1:N

A_mul_B(y, A, x)

s = fastsum(y)

for j = 1:length(x)

x[j] = y[j]/s

end

end

x

end

function fastsum{T}(x::Array{T})

s = zero(T)

for i = 1:length(x)

s += x[i]

end

s

end

My timing results for N=100000:

Julia “beliefprop” (your version) 0.33s

Julia “beliefprop_noalloc” 0.157s

C (remembering -O2) 0.094s

There’s less than a factor of two separating the faster Julia version and C, on my machine. (In this case, I think that almost the only thing left between Julia’s and C’s performance is that Julia does automatic bounds-checking on each element access.)

Naturally this is less readable than your original. But in a way that’s the nice thing about Julia: if you really want to go head-to-head with C, you can usually do that pretty well. If instead you want readability, you can get it, and the performance is still quite decent.

Actually, I tried alternative definitions of fib such as taking exactly the definition used for the Julia benchmarks:

fib(n) = n < 2 ? n : fib(n-1) + fib(n-2)

And I found this, oddly, to actually be slightly slower than my version. I also experimented with adding typing information to the function like "function fib(n::Int)", but this makes zero difference since Julia is smart enough to infer those types anyway.

Based on implementing a "real" MCMC algorithm for inference on an MRF, I totally agree with your point about Julia being flexible on the readability/performance tradeoff. The way things "should" work is you start with an easy version to get it right, then optionally add typing information, and optimize from there. Even if you just end up with "C in Julia", it is a far more pleasant process than actually rewriting the entire function in C in one go. (Which I've done more times than I like to think about with Matlab.)

Hmm, that is odd. Well, I don’t understand it, but clearly you’ve tried quite a few permutations.

In turn, I completely agree with your point about rewriting Matlab in C! That’s something I too have done too many times.

Seems interesting, I had never head of Julia before.

I’m all for replacing Matlab with other tools. I was a big Matlab user myself a couple of years ago. Eventually I started using Octave and Gnuplot. Today I do all my work using Python (with numpy, scipy and matplotlib), and if I want to optimize something I use Cython and C.

I don’t think this is a reasonable benchmark. Experienced matlab uses will never multiply matrix with for loops. Stackoverflow(http://stackoverflow.com/questions/6058139/why-is-matlab-so-fast-in-matrix-multiplication) has a post on how fast the matlab is in matrix multiplication.

for x=lb:(ub-lb)/npoints:ub

val = val + sin(x)/npoints;

end

It doesn’t make sense to do it that way in Matlab. You should use something like this so that it runs at the optimum speed:

val = sum(sin(lb : (ub – lb) / npoints : ub)) / (npoints + 1)

I think you’ll be pleasantly surprised with the speed of Matlab if you follow the advice in this article:

http://web.cecs.pdx.edu/~gerry/MATLAB/programming/performance.html

Hi guys,

The benchmarks are naive, but the point is that they are all naive in the different languages in the same way. Matrix multiplication is intended as a surrogate for more complex problems where each language doesn’t have a wrapper to super-optimized Fortan code. Similarly, the non-vectorized Matlab is intended as a surrogate to more complex cases where the problem is difficult or impossible to vectorize. The reason the simple benchmarks are used here is because they are… simple! A more complex example is always going to be less transparent about what exactly is causing the different performance.

That doesn’t make sense. MATLAB has a huge library to enable vectorizations. In fact, there are many ways to simply avoid loops altogether, for example via BSXFUN.

If the claim is that any program can be vectorized… well… I disagree with that claim.

The number of operations that cannot be vectorized efficiently in Matlab is mind-blowing. Have you ever tried:

– To implement a heap algorithm

– To compute the determinant of 10^8 2-by-2 matrices

– To find the smallest integer bigger than n that can be expressed as a product of powers of 2, 3, 5, and 7

– To find certain items in a huge on-disk array without having to (or being able to) load the whole thing into memory at once

– (prior to accumarray): Given n items each assigned to one of m clusters, collect the list of items in each cluster in less than m*n time

etc, etc, etc

The accumarray example illustrates the main point: most of the “interesting” parts of Matlab are written in C. If you do interesting things that no one has yet written in C, you’re doomed to write endless MEX files.

> The main thing that at Matlab programmer will miss in Julia is undoubtedly plotting.

You can now (May 2013) import Python modules and call them from Julia with the PyCall module, using Julia’s Array type. Because Python has very extensive, high-quality plotting functions with the matplotlib library, this is probably sufficient for 99 % of Matlab users. Also, Python provides bindings to

Links:

https://github.com/stevengj/PyCall.jl

http://matplotlib.org/gallery.html

Numerous other possibilities to plot from Python are listed here:

http://wiki.python.org/moin/NumericAndScientific/Plotting

Hi,

I’m using Matlab with

feature accel on

and fib with for loop:

function n=fib_for(n)

f=[1 1];

if n > 2

for i=1:n-2

f_n=f(1)+f(2);

f=[f(2) f_n];

end

end

n=f(2);

tic, fib_for(30), toc

ans =

832040

Elapsed time is 0.000370 seconds.

and using your function:

tic, fib_nest(30), toc

ans =

832040

Elapsed time is 10.050176 seconds.

Anyway, nice post!

It’s poor form to bring a dynamic programming algorithm to a naive benchmarking party!

I’m just a naive user 🙂 I’d never come up on my own with the fib function you’ve used, and just wanted to compare it with the ‘normal’ way I’m using Matlab.

And now I’ve learned that I’m using dynamic programming 🙂

You say “Another surprise to me was how often Matlab’s JIT managed a speed within a reasonable factor of C”

I assume you mean “Julia’s JIT” here.

Actually, I did mean Matlab. I guess I had unrealistically low expectations for it?

A few comments here:

1) Matlab JIT has seen significant improvement since 2012a. Running the fib code now produces a benchmark of 0.0469s in R2017a on a i7 3.4GHz machine.

2) Recursion is notoriously slow unless the language is designed and optimised for it. All recursion can be re-written as a for/while loop. Compare the following (still easy to implement and read):

function f=fib(N)

if N<3

f = 1;

return;

end

f = [1 1];

for n=3:N

f = [f(2) sum(f)];

end

f = f(2);

end

gives a benchmark of <23us (microseconds), i.e. ~2000 times faster.

Hi Adam, I’m aware that I was using a naive algorithm for fibonacci, because I was interested in how fast (implementations of) *languages* were, rather than different algorithms. So I don’t see much value in switching out fibonacci to a faster algorithm (yours is basically the same as Piotr’s from back in 2014). It’s worthwhile to see how fast the stupid algorithm is since it’s representative of a class of problems where there is no dynamic programming alternative.

It’s an interesting point that all recursion can be re-written as a for/while loop. In the general case, won’t I essentially end up implementing a stack of arguments that’s isomorphic to the function call stack with recursion? That’s true but painful. Languages are here to serve use and recursion is convenient, so I want it to be fast!