Julia, Matlab, and C

Julia is a new language in the same arena as Matlab or R. I’ve had failed attempts to quit the Matlab addiction in the past, making me generally quite conservative about new platforms. However, I’ve recently been particularly annoyed by Matlab’s slow speed, evil license manager errors, restrictions on parallel processes, C++ .mex file pain, etc., and so I decided to check it out. It seems inevitable that Matlab will eventually displaced by something. The question is: is that something Julia?

“We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.”

Essentially, the goal seems to be a faster, freer Matlab that treats users like adults (macros!) and doesn’t require writing any .mex files in C++ or Fortan. Sounds too good to be true? I decided to try it out. My comparisons are to Matlab (R2012a) and C (gcc 4.2 with -O2 and -fno-builtin to prevent compile-time computations) all on a recent MacBook Pro (2.3 GHz Intel Core i7 w/ 8GB RAM).

Installation was trivial: I just grabbed a pre-compiled binary and started it up.

I deliberately used naive algorithms, since I am just testing raw speed. It should be a fair comparison, as long as the algorithm is constant. Please let me know about any bugs, though.

Click here to skip to the results.

First benchmark: Fibonnaci

% Matlab
function f=fib(n)
  if n <= 2
    f=1.0;
  else
    f=fib(n-1)+fib(n-2);
  end
end

// C
double fib(int n){
  if(n<=2)
    return(1.0);
  else
    return(fib(n-2)+fib(n-1));
}

% julia
function fib(n)
  if n <= 2
    1.0
  else
    fib(n-1)+fib(n-2);
  end
end

Clarity is basically a tie. Running them for n=30 we get:

time in matlab (fib): 14.344231
time in c (fib):       0.005887
time in julia (fib):   0.237832

Second benchmark: Matrix Multiplication

This is a test of the naive O(N^3) matrix multiplication algorithm.

% matlab
function C=mmult(A,B,C)
  [M,N] = size(A);
  for i=1:M
    for j=1:M
      for k=1:M
        C(i,j) = C(i,j) + A(i,k)*B(k,j);
      end
    end
  end
end

// C
#define M 500
void mmult(double A[M][M],double B[M][M], double C[M][M]){
  //double C[M][M];
  int i,j,k;
  for(i=0; i<M; i++)
    for(j=0; j<M; j++){
      C[i][j] = 0;
      for(k=0; k<M; k++)
	C[i][j] += A[i][k]*B[k][j];
    }
}

# julia
function mmult(A,B)
  (M,N) = size(A);
  C = zeros(M,M);
  for i=1:M
    for j=1:M
      for k=1:M
        C[i,j] += A[i,k]*B[k,j];
      end
    end
  end
  C;
end

Here, I think that Matlab and Julia and a bit clearer, and Julia wins though the wonders of having “+=”. The timing results on 500×500 matrices are:

time in matlab (matmult): 1.229571
time in c (matmult):      0.157658 
time in julia (matmult):  0.5029549

Third Benchmark: numerical quadrature

Here, we attempt to calculate the integral \int_{x=5}^{15} sin(x) dx by numerical quadrature, using a simple midpoint rule with computations at 10^7 points.

% matlab
function val=numquad(lb,ub,npoints)
  val = 0.0;
  for x=lb:(ub-lb)/npoints:ub
    val = val + sin(x)/npoints;
  end
end

// C
double numquad(double lb,double ub,int npoints){
  double val = 0.0;
  int i;
  for(i=0; i<=npoints; i++){
    double x = lb + (ub-lb)*i/npoints;
    val += sin(x)/npoints;
  }
  return(val);
}

# julia
function numquad(lb,ub,npoints)
  val = 0.0
  for x=lb:(ub-lb)/npoints:ub
    val += sin(x)/npoints
  end
  val
end

The timings are:

time in matlab (numquad): 0.446151
time in c (numquad):      0.167112 
time in julia (numquad):  0.256597

Fourth Benchmark: Belief Propagation

Finally, I decided to try a little algorithm similar to what I actually tend to implement for my research. Roughly speaking, Belief Propagation is a repeated sequence of matrix multiplications, followed by normalization.

% matlab
function x=beliefprop(A,x,N)
  for i=1:N
    x = A*x;
    x = x/sum(x);
  end
end

// C
void beliefprop(double A[25][25], double x[25], int N){
  int i, n, j;
  double x2[25];
  for(n=0; n<N; n++){
    for(i=0; i<25; i++){
      x2[i]=0;
      for(j=0; j<25; j++)
	x2[i] += A[i][j]*x[j];
    }   
    for(i=0; i<25; i++)
      x[i]=x2[i];
    double mysum = 0;
    for(i=0; i<25; i++)
      mysum += x[i];
    for(i=0; i<25; i++)
      x[i] /= mysum;
  }
  return;
}

% julia
function beliefprop(A,x,N)
  for i=1:N
    x = A*x;
    x /= sum(x);
  end
  x
end

Here, I think we can agree that Matlab and Julia are clearer. (Please don’t make fun of me for hardcoding the 25 dimensions in C.) Using a matrix package for C would probably improve clarity, but perhaps also slow things down. The results are:

time in matlab (beliefprop): 0.627478
time in c (beliefprop):      0.074355 
time in julia (beliefprop):  0.376427

Fifth Benchmark: BP in log-space

In practice, Belief Propagation is often implemented in log-space (to help avoid numerical under/over-flow.). To simulate an algorithm like this, I tried changing to propagation to take an exponent before multiplication, and a logarithm before storage.

% matlab
function x=beliefprop2(A,x,N)
  for i=1:N
    x = log(A*exp(x));
    x = x - log(sum(exp(x)));
  end
end

// C
void beliefprop2(double A[25][25], double x[25], int N){
  int i, n, j;
  double x2[25];
  for(n=0; n<N; n++){
    for(i=0; i<25; i++){
      x2[i]=0;
      for(j=0; j<25; j++)
	x2[i] += A[i][j]*exp(x[j]);
    }   
    for(i=0; i<25; i++)
      x[i]=log(x2[i]);
    double mysum = 0;
    for(i=0; i<25; i++)
      mysum += exp(x[i]);
    double mynorm = log(mysum);
    for(i=0; i<25; i++)
      x[i] -= mynorm;
  }
  return;
}

# julia
function beliefprop2(A,x,N)
  for i=1:N
    x = log(A*exp(x));
    x -= log(sum(exp(x)));
  end
  x
end

Life is too short to write C code like that when not necessary. But how about the speed, you ask?

time in matlab (beliefprop2): 0.662761
time in c (beliefprop2):      0.657620 
time in julia (beliefprop2):  0.530220

Sixth Benchmark: Markov Chain Monte Carlo

Here, I implement a simple Metropolis algorithm. For no particular reason, I use the two-dimensional distribution:

p(x) \propto \exp(\sin(5 x_1) - x_1^2 - x_2^2)

% matlab
function mcmc(x,N)
f = @(x) exp(sin(x(1)*5) - x(1)^2 - x(2)^2);
p = f(x);
for n=1:N
  x2 = x + .01*randn(size(x));
  p2 = f(x2);
  if rand < p2/p
    x = x2;
    p = p2;
  end
end
end

// C
double f(double *x){
  return exp(sin(x[0]*5) - x[0]*x[0] - x[1]*x[1]);
}
#define pi 3.141592653589793
void mcmc(double *x,int N){
  double p = f(x);
  int n;
  double x2[2];
  for(n=0; n<N; n++){
    // run Box_Muller to get 2 normal random variables
    double U1 = ((double)rand())/RAND_MAX;
    double U2 = ((double)rand())/RAND_MAX;
    double R1 = sqrt(-2*log(U1))*cos(2*pi*U2);
    double R2 = sqrt(-2*log(U1))*sin(2*pi*U2);
    x2[0] = x[0] + .01*R1;
    x2[1] = x[1] + .01*R2;
    double p2 = f(x2);
    if(((double)rand())/RAND_MAX< p2/p){
      x[0] = x2[0];
      x[1] = x2[1];
      p = p2;
    }
  }
}

% julia
function mcmc(x,N)
f(x) = exp(sin(x[1]*5) - x[1]^2 - x[2]^2);
p = f(x);
for n=1:N
  x2 = x + .01*randn(size(x));
  p2 = f(x2);
  if rand() < p2/p
    x = x2;
    p = p2;
  end
end
end

Again, I think that C is far less clear than Matlab or Julia. The timings are:

time in matlab (mcmc): 7.747716
time in c (mcmc):      0.150776 
time in julia (mcmc):  0.479628

Table

All times are in seconds. (Lower is better.)

           Matlab     C     Julia
fib        14.344   0.005   0.237
matmult     1.229   0.157   0.502
numquad     0.446   0.167   0.256
bp          0.627   0.074   0.376
bp2         0.662   0.657   0.530
mcmc        7.747   0.150   0.479

Conclusions

I’m sure all these programs can be sped up. In particular, I’d bet that an expert could optimize the C code to beat Julia on bp2 and mcmc. These are a test of “how fast can Justin Domke make these programs”, not the intrinsic capabilities of the languages. That said, Julia allows for optional type declarations. I did experiment with these but found absolutely no speed improvement. (Which is a good or a bad thing, depending on how you look at life.)

Another surprise to me was how often Matlab’s JIT managed a speed within a reasonable factor of C. (Except when it didn’t…)

The main thing that at Matlab programmer will miss in Julia is undoubtedly plotting. The Julia designers seem to understand the importance of this (“non-negotiable”). If Julia equalled Matlab’s plotting facilities, Matlab would be in real trouble!

Overall, I think that the killer features of freedom, kinda-sorta-C-like speed, and ease of use make Julia more likely as a Matlab-killer than other projects such as R, Sage, Octave, Scipy, etc. (Not to say that those projects have not succeeded in other ways!) Though Julia’s designers also seem to be targeting current R users, my guess is that they will have more success with Matlab folks in the short term, since most Matlab functionality (other than plotting) already exists, while reproducing R’s statistical libraries will be quite difficult. I also think that Julia would be very attractive to current users of languages like Lush. Just to never write another .mex file, I’ll very seriously consider Julia for new projects. Other benefits such as macros, better parallelism support are just bonuses. As Julia continues to develop, it will become yet more attractive.

There was an interesting discussion on Lambda the Ultimate about Julia back when it was announced

About these ads
This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

29 Responses to Julia, Matlab, and C

  1. Jan says:

    Hi Justin,
    Thanks for the interesting article and for introducing ‘Julia’ to me. It sounds very interesting.
    Just a quick question. In benchmark 4 in Julia’s code you are multiplying matrix ‘A’ with ‘exp(x)’, but don’t do that in the matlab and C codes. This might have degraded Julia’s performance, if I’m not mistaken. On my machine, with the ‘exp(x)’ calculation, the code is almost twice as slow, compared to just multiplying by ‘x’.
    Best regards and keep up the great blog! Thanks

  2. df says:

    I’ve also been watching Julia, however since you mentioned Lush I figured I’d also throw out Torch7 (the successor to the successor of Lush). It uses Lua as it’s scripting language (which if you use LuaJIT is also near C speed), plus an SSE & CUDA accelerated Tensor library (along with neural nets, etc.). http://www.torch.ch

    I’ve only recently started working with it myself, but I like it so far.

  3. justindomke says:

    Hi Jan,

    You’re absolutely right. I re-ran the benchmark, and updated the code and results above. Strangely, it only slightly increased the speed. Thanks for pointing that out!

  4. Jan says:

    Hi again. I think if you increase the value of N to some large number, the difference in speed will become more noticeable. My trivial, and maybe silly, example was ‘A = ones(25,25);’ and ‘x=ones(25,1);’ and ‘N=1000000;’. Without ‘exp’ ~1.5s vs. ~3s with ‘exp’.
    Cheers

  5. The suffocated says:

    Your C version of f() for MCMC is defined differently from the other two versions. Perhaps this has some impact on the computation time. The performance of Box-Muller method is also inferior to some other normal random number generation methods (such as the “pyramid” method that Matlab uses). I’ve tried Boost.Random library (www.boost.org) and the C++ version was 20% faster.

  6. The suffocated says:

    Also, in bp2, if you use the Eigen2 or Eigen3 libraries (http://eigen.tuxfamily.org), the resulting code would be much more compact and much faster. On my machine, the CPU time is reduced by 90%. Here’s the Eigen 2 code:

    #include
    #include
    #include

    void beliefprop2(
    const Eigen::MatrixXd & A,
    Eigen::VectorXd & x, int N)
    {
    for (int n=0; n<N; ++n) {
    x = (A * x.cwise().exp()).cwise().log();
    x.cwise() -= std::log(x.cwise().exp().sum());
    }
    return;
    }

  7. justindomke says:

    Thanks for the comments. I accidentally posted an older version of the c code for mcmc than I used for running the benchmarks. So, fortunately, the current (corrected) code corresponds to the original timing.

    I am actually a big fan of Eigen, and have been using it to write matlab .mex functions for a few years now. A reduction by 90% is pretty amazing, though. Do you have any insight into if it is the matrix operations, or the exp/log that is getting sped up so much?

  8. Tim Holy says:

    Hmm, I get quite different results from you for some of your benchmarks. For example, while you show that Julia is 5x slower than C for bp, for Julia is faster than your C version:

    For N = 10000:
    Julia: 0.028-0.054s (typically around 0.035s)
    C: 0.046-0.053s

    The fluctuations in Julia are due to the garbage collector sometimes being called; if you execute gc() before timing your function you get consistent timing between 0.028-0.029, almost 2x faster than the C version. Presumably, this is due to Julia’s using BLAS libraries. In any event, this is not at all consistent with Julia being 5x slower than C on this benchmark.

    Are you sure you were measuring the same thing (wall clock time) with both of them? Did you make sure to let Julia run it multiple times without reloading the file? (The first execution will be slow because Julia has to JIT-compile the function.)

    If all those things are good, your precompiled binary may not be up to the performance you’d get from a fresh git checkout.

  9. justindomke says:

    Interesting. With Julia, I get around .04s for N=10000 and around .45s for N=100000, which pretty much fits with your numbers. So, it seems that the question is why C is so slow for you. Did you compile with optimization? (I get .02s for N=10000 with no optimization, and .007s with -O2)

    It certainly *could* be that I’m measuring something incorrectly, though I don’t think so. For Julia I do something like this:

    tic()
    x=beliefprop(A,x,100000);
    println(“time in julia (beliefprop): “,toc())

    While in C I do something like:

    begin = clock();
    beliefprop(A2,x,10000);
    end = clock();
    time_spent = (double)(end – begin) / CLOCKS_PER_SEC;
    printf(“time in c (beliefprop): %f \n”, time_spent);

  10. Tim Holy says:

    Ah, I forgot -O2 (dumb mistake). That drops the C time a lot, to the point where it wins (~0.02s vs ~0.03s), but it’s still a much closer race between Julia and C than you get.

    I’m timing Julia code this way:
    @time y = beliefprop(A, x, 10000);

    or

    gc(); @time y = beliefprop(A, x, 10000);
    if you want Julia to start with a clean slate on each run.

  11. Tim Holy says:

    Hmm, in the way you time the Julia code, there’s the potential for other things besides the function you’re testing to eat up time before that toc() call. Your C version does it more carefully, which may be another source of disparity. Consistent with this, you may notice that the timing of fib() as listed on the Julia website is far more competitive with C than you are showing, despite the fact that your fib() is basically the same as theirs. I wonder if this comes down to differences in how you measure the timing.

    I also noticed that the mmult code includes allocation in the timing for Julia (that “zeros” call allocates the memory and fills it with zeros), but not for any of the others.

    As a final tip, if you really want the maximum performance from Julia, it’s possible to do things like this:
    A_mul_B(result, A, B)
    and it will store the result of A*B in the pre-allocated array “result”. If you wanted to, you could use this to make your Julia beliefprop behave more like your C version (using two pre-allocated arrays to avoid the overhead of allocating temporaries), at some cost in readability.

  12. Tim Holy says:

    And one more point…your C timing measures CPU time, your Julia and Matlab versions measure wall clock time. I used gettimeofday to make sure the C version was recording wall clock time, too.

  13. justindomke says:

    OK! Well, I tried installing Julia from source, which didn’t give any speedup. I also changed the C code to measure times with a harness like:

    struct timeval t0;
    struct timeval t1;
    gettimeofday(&t0, 0);
    beliefprop2(A2,x,100000);
    gettimeofday(&t1, 0);
    printf(“time in c (beliefprop2): %f \n”, t1.tv_sec-t0.tv_sec + (t1.tv_usec-t0.tv_usec)/1000000.0);

    and in Julia with something like:

    gc()
    tic()
    mmult(A,B,C)
    println(“time in julia (matmult): “,toc());

    I tested doing this toc() before the print, but this made no difference. I updated all the numbers above. The call to gc() did reliably speed up almost everything in Julia. However, the times are also now faster for C, meaning overall Julia is looking somewhat worse.

    I do realize that my results for Fibonacci are totally different from those on the Julia homepage. If anyone could explain why, I’d be very interested to know.

  14. Tim Holy says:

    For fib: there are two main differences between yours and the one here:

    https://github.com/JuliaLang/julia/blob/master/test/perf/perf.jl

    First, a one-line function definition gets inlined in Julia, whereas a multi-line definition (currently) doesn’t. Second, you may have a type problem: 1.0 is a floating-point type, but your inputs are integers, and that means that type conversions have to take place. The syntax “one(T)” construct 1 in type T.

    For bp: I also remembered that Julia’s “sum” currently emphasizes accuracy over speed, and uses the KBN algorithm (http://en.wikipedia.org/wiki/Kahan_summation_algorithm) to correct for roundoff error. Whether to emphasize speed or accuracy has been discussed repeatedly, but there does not seem to be an easy answer, and the current version favors accuracy. Part of the reason is that, if you want speed, you can trivially write the fast version yourself. Combining it with the (nearly) “allocation-free” version, it might look like this:

    function beliefprop_noalloc(A,x,N)
    x = copy(x) # don’t mutate x
    y = similar(x) # use the same temporary each time
    for i=1:N
    A_mul_B(y, A, x)
    s = fastsum(y)
    for j = 1:length(x)
    x[j] = y[j]/s
    end
    end
    x
    end

    function fastsum{T}(x::Array{T})
    s = zero(T)
    for i = 1:length(x)
    s += x[i]
    end
    s
    end

    My timing results for N=100000:
    Julia “beliefprop” (your version) 0.33s
    Julia “beliefprop_noalloc” 0.157s
    C (remembering -O2) 0.094s
    There’s less than a factor of two separating the faster Julia version and C, on my machine. (In this case, I think that almost the only thing left between Julia’s and C’s performance is that Julia does automatic bounds-checking on each element access.)

    Naturally this is less readable than your original. But in a way that’s the nice thing about Julia: if you really want to go head-to-head with C, you can usually do that pretty well. If instead you want readability, you can get it, and the performance is still quite decent.

  15. justindomke says:

    Actually, I tried alternative definitions of fib such as taking exactly the definition used for the Julia benchmarks:

    fib(n) = n < 2 ? n : fib(n-1) + fib(n-2)

    And I found this, oddly, to actually be slightly slower than my version. I also experimented with adding typing information to the function like "function fib(n::Int)", but this makes zero difference since Julia is smart enough to infer those types anyway.

    Based on implementing a "real" MCMC algorithm for inference on an MRF, I totally agree with your point about Julia being flexible on the readability/performance tradeoff. The way things "should" work is you start with an easy version to get it right, then optionally add typing information, and optimize from there. Even if you just end up with "C in Julia", it is a far more pleasant process than actually rewriting the entire function in C in one go. (Which I've done more times than I like to think about with Matlab.)

  16. Tim Holy says:

    Hmm, that is odd. Well, I don’t understand it, but clearly you’ve tried quite a few permutations.

    In turn, I completely agree with your point about rewriting Matlab in C! That’s something I too have done too many times.

  17. nlw0 says:

    Seems interesting, I had never head of Julia before.

    I’m all for replacing Matlab with other tools. I was a big Matlab user myself a couple of years ago. Eventually I started using Octave and Gnuplot. Today I do all my work using Python (with numpy, scipy and matplotlib), and if I want to optimize something I use Cython and C.

  18. seckcoder says:

    I don’t think this is a reasonable benchmark. Experienced matlab uses will never multiply matrix with for loops. Stackoverflow(http://stackoverflow.com/questions/6058139/why-is-matlab-so-fast-in-matrix-multiplication) has a post on how fast the matlab is in matrix multiplication.

  19. kintaar says:

    for x=lb:(ub-lb)/npoints:ub
    val = val + sin(x)/npoints;
    end
    It doesn’t make sense to do it that way in Matlab. You should use something like this so that it runs at the optimum speed:
    val = sum(sin(lb : (ub – lb) / npoints : ub)) / (npoints + 1)

    I think you’ll be pleasantly surprised with the speed of Matlab if you follow the advice in this article:

    http://web.cecs.pdx.edu/~gerry/MATLAB/programming/performance.html

  20. justindomke says:

    Hi guys,

    The benchmarks are naive, but the point is that they are all naive in the different languages in the same way. Matrix multiplication is intended as a surrogate for more complex problems where each language doesn’t have a wrapper to super-optimized Fortan code. Similarly, the non-vectorized Matlab is intended as a surrogate to more complex cases where the problem is difficult or impossible to vectorize. The reason the simple benchmarks are used here is because they are… simple! A more complex example is always going to be less transparent about what exactly is causing the different performance.

  21. Anonymous says:

    That doesn’t make sense. MATLAB has a huge library to enable vectorizations. In fact, there are many ways to simply avoid loops altogether, for example via BSXFUN.

  22. justindomke says:

    If the claim is that any program can be vectorized… well… I disagree with that claim.

  23. Tim Holy says:

    The number of operations that cannot be vectorized efficiently in Matlab is mind-blowing. Have you ever tried:
    – To implement a heap algorithm
    – To compute the determinant of 10^8 2-by-2 matrices
    – To find the smallest integer bigger than n that can be expressed as a product of powers of 2, 3, 5, and 7
    – To find certain items in a huge on-disk array without having to (or being able to) load the whole thing into memory at once
    – (prior to accumarray): Given n items each assigned to one of m clusters, collect the list of items in each cluster in less than m*n time
    etc, etc, etc

    The accumarray example illustrates the main point: most of the “interesting” parts of Matlab are written in C. If you do interesting things that no one has yet written in C, you’re doomed to write endless MEX files.

  24. Johannes says:

    > The main thing that at Matlab programmer will miss in Julia is undoubtedly plotting.

    You can now (May 2013) import Python modules and call them from Julia with the PyCall module, using Julia’s Array type. Because Python has very extensive, high-quality plotting functions with the matplotlib library, this is probably sufficient for 99 % of Matlab users. Also, Python provides bindings to

    Links:

    https://github.com/stevengj/PyCall.jl

    http://matplotlib.org/gallery.html

    Numerous other possibilities to plot from Python are listed here:

    http://wiki.python.org/moin/NumericAndScientific/Plotting

  25. Piotr says:

    Hi,
    I’m using Matlab with
    feature accel on
    and fib with for loop:
    function n=fib_for(n)
    f=[1 1];
    if n > 2
    for i=1:n-2
    f_n=f(1)+f(2);
    f=[f(2) f_n];
    end
    end
    n=f(2);

    tic, fib_for(30), toc
    ans =
    832040
    Elapsed time is 0.000370 seconds.

  26. Piotr says:

    and using your function:
    tic, fib_nest(30), toc
    ans =
    832040
    Elapsed time is 10.050176 seconds.

    Anyway, nice post!

  27. justindomke says:

    It’s poor form to bring a dynamic programming algorithm to a naive benchmarking party!

  28. Piotr says:

    I’m just a naive user :) I’d never come up on my own with the fib function you’ve used, and just wanted to compare it with the ‘normal’ way I’m using Matlab.

  29. Piotr says:

    And now I’ve learned that I’m using dynamic programming :)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s