Author Archives: Christopher Rackauckas

Julia iFEM 2: Optimizing Stiffness Matrix Assembly

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/julia-ifem2/

This is the second post looking at building a finite element method solver in Julia. The first post was about mesh generation and language bindings. In this post we are going to focus on performance. We start with the command from the previous post:

node,elem = squaremesh([0 1 0 1],.01)

which generates an array elem where each row holds the reference indices to the 3 points which form a triangle (element). The actual locations of these points are in the array node, and so node(1) gives the points in the (x,y)-plane for the ith point. What the call is saying is that these are generated for the unit square with mesh-size .01, meaning we have 10201 triangles.

The approach to building the stiffness matrix for the Poisson equation is described here. The general idea is that span our vector space by a basis of hat functions phi_{i}, and the so the stiffness matrix is found by the inner product (integral) between these basis functions. This translates to solving for the area of the triangles where two hat functions overlap, which we can do exactly since we chose the basis to be sufficiently nice. However, since the hat functions are zero except in a small range, most of these inner products are zero, meaning the resulting stiffness matrix is sparse. Our goal is to produce this sparse matrix as efficiently as possible. Lets get to it!

Building the Julia Version: Local Stiffness

The first function we need is titled localStiffness, which evaluates the inner product to give the “stiffness for one triangle”. In MATLAB the code is of the form:

function [At,area] = localstiffness(p)
    At = zeros(3,3);
    B = [p(1,:)-p(3,:); p(2,:)-p(3,:)];
    G = [[1,0]',[0,1]',[-1,-1]'];
    area = 0.5*abs(det(B));
    for i = 1:3
        for j = 1:3
            At(i,j) = area*((BG(:,i))'*(BG(:,j)));
        end
    end
end

It takes in a vector p of three points, solves for the area at the reference triangle, and transforms the area appropriately to give the stiffness for the triangle defined by the points of p. I want to make a few points about this code. First of all, it employs a “trick” for solving for the dot product. That is, it uses the transposed vector times another vector. I quote trick because to a mathematician, this is simply the definition of the dot product, and so it only seems natural to use it like this in MATLAB. However, two things to point out. First of all, in Julia, such an operation does not return a scalar, but a one dimensional vector. This in some cases will give unexpected errors, so it should be avoided. Not only that, but it is also inefficient. In both MATLAB and Julia, matrices are stored column-wise, that is they are stored as a array of pointers where the pointers go to an array of columns. Thus to access a row matrix, both MATLAB and Julia would have to access the pointer and then go to the array at which it points (a size 1 array), and take the value there. Notice this is an extra step. Therefore it’s more efficient to keep the vectors column-wise. (In reality, the vectors here are so small that it won’t make a difference, but this justifies that the change we will do for the Julia code in a performance-sense).

Two other small changes were made to this code. For one, Julia throws an error at the definition of G since it cannot read the transpose calls inside of an array declaration. This is easily fixed by simply transposing after the creation instead. Lastly, we change the pre-allocation of At from zeros to Julia’s general constructor. This is slightly more performant since it will allocate the space without doing an initial re-write step, saving the time it would take to loop through and set each value to zero. The result is the following:

function localstiffness(p)
  At = Array{Float64}(3,3);
  B = [p[1,:]-p[3,:]; p[2,:]-p[3,:]];
  G = [1 0 ; 0 1 ; -1 -1]';
  area = 0.5*abs(det(B));
  for i = 1:3
    for j = 1:3
      At[i,j] = area*dot(BG[:,i],BG[:,j]); 
    end
  end
  return(At,area)
end

which is ever so slightly more efficient, but in reality the same and just tweaked for the quirks.

Building the Julia Version: Matrix Assembly

Now we need to loop through each triangle and sum up the inner product between each pair of basis functions over each triangle. The intuitive code is:

function assemblingstandard(node,elem)
  N=size(node,1); NT=size(elem,1);
  A=zeros(N,N);
  for t=1:NT
    At,=localstiffness(node[vec(elem[t,:]),:]);
    for i=1:3
      @simd for j=1:3
        @inbounds A[elem[t,i],elem[t,j]]=A[elem[t,i],elem[t,j]]+At[i,j];
      end
    end
  end
  return(A)
end

I discussed previously the use of macros to speed up code without changing its style. The main problem that had to addressed in porting the code to Julia was that Julia will only take a vector for array referencing when the colon operator is used. Therefore since elem[t,:] returns a 1×3 Matrix of indices for the points associated with triangle t, once again a 1×3 Matrix is not an array in Julia so it throws an error. This is easy to fix by wrapping it in the vec() command, which others have tested to be the fastest method for conversion, and will actually do some fanciness in the background in order to not have to copy the array. This means that the cost of using vec is so small that I will use it liberally as a fix in these cases. Notice that within the loop vec is not required to reference A. This is because the issue only occurs when the colon operator is present.

However, since At is usually zero, we can improve this a lot by instead generating vectors to build a sparse matrix. What we will instead do is save the (i,j) pairs where the value should be stored, and the value, and use the sparse command to reduce. The sparse command will automatically add together the values from repeated (i,j) indices, effectively performing the update we had before. This gives us the code:

function assemblingsparse(node,elem)
  N = size(node,1); NT = size(elem,1);
  i = Array{Int64}(NT,3,3); j = Array{Int64}(NT,3,3); s = zeros(NT,3,3);
  index = 0;
  for t = 1:NT
    At, = localstiffness(node[vec(elem[t,:]),:]);
    for ti = 1:3, tj = 1:3
        i[t,ti,tj] = elem[t,ti];
        j[t,ti,tj] = elem[t,tj];
        s[t,ti,tj] = At[ti,tj];
    end
  end
  return(sparse(vec(i), vec(j), vec(s)));
end

Notice here that vec() needed to be used to build the sparse matrix because the easiest way to index within the loop was to use a 3-dimensional array and then flatten it via vec(). Notice a neat Julia trick where you can define multiple for loops in one line: “for ti = 1:3, tj = 1:3” is two loops and works as you’d expect. With some extra work we can get rid of the loop over the triangles by performing the calculations from localstiffness vector-wise, which gives us the vectorized form:

function assembling(node,elem)
  N = size(node,1); NT = size(elem,1);
  ii = Vector{Int64}(9*NT); jj = Vector{Int64}(9*NT); sA = Vector{Float64}(9*NT);
  ve = Array{Float64}(NT,2,3)
  ve[:,:,3] = node[vec(elem[:,2]),:]-node[vec(elem[:,1]),:];
  ve[:,:,1] = node[vec(elem[:,3]),:]-node[vec(elem[:,2]),:];
  ve[:,:,2] = node[vec(elem[:,1]),:]-node[vec(elem[:,3]),:];
  area = 0.5*abs(-ve[:,1,3].*ve[:,2,2]+ve[:,2,3].*ve[:,1,2]);
  index = 0;
  for i = 1:3, j = 1:3
     @inbounds begin
     ii[index+1:index+NT] = elem[:,i];
     jj[index+1:index+NT] = elem[:,j];
     sA[index+1:index+NT] = sum(ve[:,:,i].* ve[:,:,j],2) ./(4*area); # Replacing dot(ve[:,:,i],ve[:,:,j],2)
     index = index + NT;
     end
   end
  return(sparse(ii,jj,sA));
end

Here I note that Julia’s dot product will not act on matrices, only vectors. In order to do the row-wise dot product like we would do in MATLAB, we can simply use .* and sum the results in each row.

Dispelling a Myth: Vectorized Julia Rocks!

The most common complaint about Julia that people tend to have is that, in many cases, the code which gets the most performance is the de-vectorized code. “But the vectorized code can be so beautiful! Why would I want to change that?”. This myth seems to come from the alpha days or really bad tests, but it doesn’t seem to die. Instead, what I wish to show here is that vectorized code is also faster in Julia. First we run some basic timings:

@time assemblingstandard(node,elem);
@time assemblingsparse(node,elem);
@time assembling(node,elem);
 
0.538606 seconds (2.52 M allocations: 924.214 MB, 12.79% gc time)
0.322044 seconds (2.52 M allocations: 137.714 MB, 21.53% gc time)
0.015182 seconds (775 allocations: 28.650 MB, 16.97% gc time)

These timings pretty stability show this pattern. All of the methods were tried with parallelization and simd options with either no speedup or it being detrimental given the problem size. What this shows is that, out of the intuitive forms for solving these equations, the vectorized form was by far the fastest. The reason is that this is a highly vectorizable problem, whereas I discussed before the limitations that can cause vectorization to lose to devectorization.

But how does this fare against MATLAB? The “same” code was run in MATLAB (this is from iFEM, an optimized library Professor Long Chen) which gives the results:

tic; assemblingstandard(node,elem); toc;
tic; assemblingsparse(node,elem); toc;
tic; assembling(node,elem); toc;
 
Elapsed time is 0.874919 seconds.
Elapsed time is 0.698763 seconds.
Elapsed time is 0.020453 seconds.

To ensure the time difference between the vectorized versions, we had both problems solve it in a loop:

@time for i = 1:1000
  assembling(node,elem);
end
 
 9.312876 seconds (821.25 k allocations: 27.980 GB, 21.32% gc time)

vs MATLAB:

   tic; 
   for i=1:1000 assembling(node,elem); 
   end
   toc;
 
Elapsed time is 19.221982 seconds.

Thus, in line with what this coder found with R, the vectorized code ran more than twice as fast in Julia.

The Julia Optimization Mentality

This was a fun little exercise because I had no idea how it would turn out. Quite frankly, when I started I thought MATLAB would slightly edge out Julia when running such vectorized code. However, Julia continues to impress me. The only major problem that I had this time around was finding out to use the vec() function to deal with “1-dimensional matrices”, but once that was found I was able to get Julia to be faster than MATLAB, even though I know much more about MATLAB and this package itself is quite well optimized.

I think I should end on why I find the Julia philosophy compelling for scientific computing. The idea is not that “you have to devectorize to get the best code”, though there are situations where doing so can dramatically increase your speed. The idea is that you don’t have to contort yourself to vectorization to make everything work. In MATLAB, R, and Python, you have to vectorize in order to make your code to ever finish. That is not the case in Julia. Here, just write the code that seems natural and it will do really well. In this case, vectorized code was natural, and as you could see we got something that was even faster. To do better in MATLAB, we would at this stage have to start writing C/MEX code. In Julia, we could expand out the loops, play with adding SIMD calls, caching, etc. directly in the Julia language.

For scientific computing where we just want code that’s good enough to work, you can see it’s easier to get there with Julia. If you need to optimize it to be part of a library, you can optimize and devectorize it within Julia without having to go to C (many times by just adding macros throughout your code). Will it be as fast as C? No, but many tests show that you’ll at least get within a factor of two so, for almost every case, you might as well just code it in Julia and move on. Each blog post I do I am getting more and more converted!

The post Julia iFEM 2: Optimizing Stiffness Matrix Assembly appeared first on Stochastic Lifestyle.

Optimizing .*: Details of Vectorization and Metaprogramming

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/an-in-depth-look-at-vectorization/

For many of us mathematicians, we were taught to use MATLAB, and we were taught to vectorize everything. I mean obviously if we have matrices A, B, and C and want to multiply element-wise (say to solve a reaction-equation at each point in space), then the optimal code is

A.*B.*C

No questions to ask, right? Actually, this code isn’t as optimized as you’d think. Lets dig deeper.

BLAS, Linpack, and MKL

The reason you are always told by “the lords of numerical math” to vectorize your code is because very smart programmers worked really hard on making basic things work well. Most of the “standardized” vectorized computations are calling subroutines from packages known as BLAS and LINPACK. To see what version your MATLAB is using, you can call

version -blas
version -linpack

So every time you call svd, A*B, etc., you are actually calling a highly-optimized state of the art C/Fortran/Assembly code mixture which does all sorts of things to make sure you don’t have cache misses, have the function multi-threaded (i.e. these functions will run parallel on your multi-core machine), and much more!

But A.*B doesn’t do a BLAS call! When you use A.*B in MATLAB, it uses its own C-code to loop through an perform this operation. This causes a few problems. First of all, if you are using co-processors/accelerators such as the Xeon Phi, the option of automatically offloading highly-vectorized problems to a GPU-like device only offloads BLAS/LINPACK calls. Therefore A*B gets accelerated whereas A.*B does not. Another thing is that, as far as I can tell, A.*B does not have all of the processor-specific optimizations to make this operation super fast. This is huge since processor technologies like AVX2 allows specific vector operations to do calls on 8 numbers at a time, making highly-parallel math operations like this get close to an 8x speedup if used correctly (and AVX3 is coming out soon to double that!).

So the simple A.*B is good for prototyping, but from benchmarking many SPDE solvers I realized this was holding me back. How can we do better? Well, we can replace .* with vector operations from MKL by directly calling MKL VML functions. MATLAB has a page on how to use the max interface to call BLAS/LINPACK functions and using this with the v?mul (i.e. vdmul for 64-bit floats). Thus to do this in MATLAB you have to write some C code. Fun!

Number of Operations ~ Number of Calls

However, this STILL isn’t optimal! Too see this, let us write out what MATLAB’s interpreter turns this into:

times(times(A,B),C)

Thus the behind the scenes looks something like this:

for i = 1:N
  for j = 1:M
     temp = A[i,j] * B[i,j];
  end
end
 
%Some function call overhead to return temp, save it to temp2
 
for i = 1:N
  for j = 1:M
     temp3 = temp2[i,j] * C[i,j];
  end
end
 
%Return temp3

Instead of one NxM loop, we did two NxM loops, effectively doubling the amount of computations required! Now think about solving reactions like:

A.*B.*C + D.*E*gamma + Diffusion*A;

you see that simple vectorization is actually far from optimal. Not only did we have to do extra loops, but we also had to make a bunch of temporary variables and garbage collect. What we really want to do is something like:

for i = 1:N
  for j = 1:M
     temp = A[i,j] * B[i,j] *C[i,j];
  end
end

How do you do this in MATLAB? Once again, the answer is to write C-code via the MEX interface. You can do this if you want.

Julia’s Approach

Lets look at if Julia does any better. Since Julia is opensource, we can actually look directly at the source code. Since it’s written in Julia and hosted on github it’s easy to go through the details. In the Julia REPL, we can find out what function is actually called by looking at:

julia> methods(.*)
# 26 methods for generic function ".*":
.*(x::Real, r::OrdinalRange{T,S}) at range.jl:574
.*(x::Real, r::FloatRange{T<:AbstractFloat}) at range.jl:575
.*(x::Real, r::LinSpace{T<:AbstractFloat}) at range.jl:576
.*(r::FloatRange{T<:AbstractFloat}, x::Real) at range.jl:578
.*(r::LinSpace{T<:AbstractFloat}, x::Real) at range.jl:579
.*(r::Range{T}, x::Real) at range.jl:577
.*(x::Number, r::Range{T}) at range.jl:646
.*(r::Range{T}, y::Number) at range.jl:647
.*(x::Number, y::Number) at operators.jl:115
.*(x::Bool, B::BitArray{N}) at bitarray.jl:1033
.*(x::Number, B::BitArray{N}) at bitarray.jl:1035
.*(A::Number, B::SparseMatrixCSC{Tv,Ti<:Integer}) at sparse/sparsematrix.jl:1028
.*{P<:Base.Dates.Period}(y::Real, X::Union{DenseArray{P<:Base.Dates.Period,N},SubArray{P<:Base.Dates.Period,N,A<:DenseArray{T,N},I<:Tuple{Vararg{Union{Colon,Int64,Range{Int64}}}},LD}}) at dates/periods.jl:56
.*{T}(A::Number, B::AbstractArray{T,N}) at arraymath.jl:118
.*(B::BitArray{N}, x::Bool) at bitarray.jl:1034
.*(B::BitArray{N}, x::Number) at bitarray.jl:1036
.*(A::SparseMatrixCSC{Tv,Ti<:Integer}, B::Number) at sparse/sparsematrix.jl:1027
.*{P<:Base.Dates.Period}(X::Union{DenseArray{P<:Base.Dates.Period,N},SubArray{P<:Base.Dates.Period,N,A<:DenseArray{T,N},I<:Tuple{Vararg{Union{Colon,Int64,Range{Int64}}}},LD}}, y::Real) at dates/periods.jl:66
.*{T}(A::AbstractArray{T,N}, B::Number) at arraymath.jl:125
.*(x::Number, J::UniformScaling{T<:Number}) at linalg/uniformscaling.jl:143
.*(J::UniformScaling{T<:Number}, x::Number) at linalg/uniformscaling.jl:144
.*(A::BitArray{N}, B::BitArray{N}) at broadcast.jl:475
.*(A::AbstractArray{Bool,N}, B::BitArray{N}) at broadcast.jl:475
.*(A::BitArray{N}, B::AbstractArray{Bool,N}) at broadcast.jl:475
.*(A::AbstractArray{T,N}, B::AbstractArray{T,N}) at sparse/sparsematrix.jl:1026
.*(As::AbstractArray{T,N}...) at broadcast.jl:307

What this is saying is that there are 26 possible function calls, each depending on the type that you put in there. For example, if you do .5.*(1:100), 1:100 is not an array but a special Julia type known as a range, and so .* calls the function .*(x::Number, r::Range{T}) which can be found in range.jl at line 647. For A.*B, we are want

.*(A::AbstractArray{T,N}, B::AbstractArray{T,N}) at sparse/sparsematrix.jl:1026

Surprisingly, it’s actually defined in the sparematrix.jl file. Find that file on GitHub and we see that it is defined as:

 
(.*)(A::AbstractArray, B::AbstractArray) = broadcast_zpreserving(MulFun(), A, B)
...
broadcast_zpreserving!(args...) = broadcast!(args...)

MulFun() is the what Julia translates * into, and so this means that when Julia sees A.*B, it calls broadcast on the * function. In the Julia documentation we see that broadcast simply calls the function on all of the array members. This is what we’d expect it to do, but is it better than if we were to write a loop? We check broadcast.jl and find:

        function broadcast!(f, B::$Bsig, As::$Asig...)
            nd = ndims(B)
            narrays = length(As)
 
            cache_f    = @get! cache      f       Dict{Int,Dict{Int,Any}}()
            cache_f_na = @get! cache_f    narrays Dict{Int,Any}()
            func       = @get! cache_f_na nd      $gbf($gbb, nd, narrays, f)
 
            func(B, As...)
            B
        end

which shows that it speeds up the operation by storing the function in cache. So moral of the story, A.*B uses broadcast which will beat your simple loop because of cache-control.

Can We Do Better?

Now just like MATLAB’s .* we have to ask, can we do better? Well, the first things we can do is use MKL VML bindings in Julia. With that packages you can plug it in and call the functions with ease (once vdmul gets added…). We can even over-write .* so that VML is used by default. However, we still have the problem we noted with MATLAB that the most efficient way would be to de-vectorize and write a loop which does multiple operations at once.

If you aren’t familiar with metaprogramming, it is simply using programs to write programs. MATLAB’s inability to metaprogram means we needed to go to C to write a loop, but that loop only works for exactly the type of inputs we had. It also causes problems since MATLAB’s anonymous functions are notoriously slow and so if you write a function that makes a function and you make it into an anonymous function (i.e. @(x) …), then you will have at least a 2x slowdown in your code over written the function yourself. Thus in MATLAB’s inability to metaprogram means you have to write a lot of code.

This is where Julia being a newer langugage comes to the rescue. It is chalk-full of metaprogramming abilities. What we are going to use here are macros. In a previous post I used macros such as @simd without explaining what it did. Well, a macro is simply a type of metaprogramming function that re-writes the code associated with it. It’s best explained by an example. There is a de-vectorization module for Julia which contains the @devec macro. What is does is take a vectorized expression like:

r = a .* b + c .* d + a

and re-writes it as a de-vectorized loop:

n = length(a)
r = zeros(n)
for i = 1 : n
    r[i] = a[i] * b[i] + c[i] * d[i] + a[i]
end

In MATLAB we can’t use loops for high-performance code because they are not optimized, so we had to take our equations, write C-code, compile it, make sure it all works, get rid of segmentation faults… it wasn’t fun. In Julia, loops are optimized so there’s no reason to go to C to write the loop, and here the macro @devec does the same thing. Therefore, to make the optimized loop in Julia, we write

@devec r = a .* b + c .* d + a

Thus the difference between your “high-performance code” and your prototype code is @devec. Welcome to the future folks.

Going all the way: Optimizing the loop

That’s not necessarily the end. Remember that BLAS/MKL/VML uses technologies like SIMD/AVX in order to do multiple operations per computation? Intel has written a Julia macro which will take loops and vectorize them in this manner by adding the macro @simd before the loop. MKL also speeds up computations by lowering the precision in the math which can be done with the @fastmath macro. Lastly, we can to turn off array out of bounds checking to make things faster as well. Thus what we really want is @devec to write the code:

n = length(a)
r = zeros(n)
@simd for i = 1 : n
    @fastmath @nobounds r[i] = a[i] * b[i] + c[i] * d[i] + a[i]
end

By adding @fastmath @nobounds before @devec we can get that part changed. What about adding @simd? Macros can take in arguments, and the developer for @devec is looking to add a context argument so you can tell it to expand with @simd, @parallel (for multi-cpu), or even with Cuda (i.e. run this on GPU in parallel), OpenCL and others. Multi-threaded will probably be added as well when it’s added to Julia’s base library. If you want it done faster, just contribute to the github repository!.

I do have to mention a caveat. There are codes where @devec will not work. For example, sqrt(exp(A*B)) where A and B are matrices. Even though it’s simple to see how we’d want to vectorize it, writing something that will go “okay, make A*B be a BLAS call, then call sqrt(exp()) on that” is hard to do generally. The creator of @devec is looking into using Julia’s generated functions to perform this kind of task. However, you can always do this by writing the loop out yourself. Or another vectorized implementation would be to use the broadcast function. Both are much easier than going to C and still do quite well performance-wise. (On this note, there is something like broadcast in MATLAB and these are the cellfun, structfun, arrayfun methods. However, they are so slow (except on GPUs where they are the fastest…) that they should never be used).

Moral of the Story

I hope I showed you that there is much more to making .* achieve its best performance than you originally thought. Of course, in MATLAB, using .* and other vectorized operations is almost always better than writing a loop. Thus in every class you’re told to ALWAYS use vectorized MATLAB functions because it’s best. In reality, that’s not the case. If you want performance, you need especially written loops in a high-performance language and you have to implement cache-handling, AVX/SIMD, multi-threading, etc. In MATLAB, this means writing C-codes and calling it via the MEX interface. This needs to be done for every different setup you encounter. However, in Julia, the @devec macro achieves the same end (almost, and it keeps getting better!). That’s the power of metaprogramming and one of the many reasons why I am trying to use more Julia.

The post Optimizing .*: Details of Vectorization and Metaprogramming appeared first on Stochastic Lifestyle.

Optimizing Julia for Performance: A Practical Example

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/optimizing-julia-performance-practical-example/

Let’s take a program which plots the standard logistic map:

r = 2.9:.00005:4;
numAttract = 100;
steady = ones(length(r),1)*.25;
for i=1:300 ## Get to steady state
  steady = r.*steady.*(1-steady);
end
x = zeros(length(steady),numAttract);
x[:,1] = steady;
for i=2:numAttract ## Now grab some values at the attractor
  x[:,i] = r.*x[:,i-1].*(1-x[:,i-1]);
end
using PyPlot;
fig = figure(figsize=(20,10));
plot(collect(r),x,"b.",markersize=.06)
savefig("plot.png",dpi=300);

This plots the logistic map. If you take the same code and change the array dereferencing from A[i,j] to A(i,j) you get MATLAB code. However, I couldn’t get MATLAB to have it looks as good, but that’s beside the point.

MATLAB Plot
Julia Plot

What I want to show are the steps for optimizing this code. In MATLAB, this is as vectorized as it gets and so you’re pretty much done. In Julia, there are a few steps that you can do from here. The first thing to notice that if this code is run directly, then all the variables are global scope which hurts performance, so the first thing we do is change this code to a function call. Next we note that we can improve performance by AVX2 which allows the processor to take in multiple numbers into the same input. Think about it as operating on vectors of 8 numbers at a time, and when AVX3 comes out, you will be able to do 16 numbers at a time. This is something you’ll want to take advantage of! Vectorized operations already have this implemented, so to implement it in our loops we simply add @simd before our for loops. Be careful when using simd: it will allows the loops to re-order the computation. Since each simulation (different r’s) is independent, we are fine. More about this can be found here.

This gives the following code:

function logisticPlot()
  tic()
  r = 2.9:.00005:4;
  numAttract = 100;
  steady = ones(length(r),1)*.25;
  @simd for i=1:400 ## Get to steady state
    @inbounds @fastmath steady = r.*steady.*(1-steady);
  end
  x = zeros(length(steady),numAttract);
  x[:,1] = steady;
  @simd for i=2:numAttract ## Now grab some values at the attractor
    @inbounds @fastmath x[:,i] = r.*x[:,i-1].*(1-x[:,i-1]);
  end
  toc()
  fig = figure(figsize=(20,10));
  plot(collect(r),x,"b.",markersize=.06)
  savefig("plot.png",dpi=300);
end
using PyPlot
@time logisticPlot()

which does better than before. Here we can’t. Normally we can do even by explicitly defining the types which allows the LLVM to know the data types and use compiler optimizations like it was C code.

This is pretty much it because we cannot do @parallel because we need to ensure we are stepping in time without skipping steps (@parallel will re-order the timesteps). So, all in all, how well did we do? The final code takes .15 seconds on my machine to run (without the plotting). However, MATLAB is at .05. Why did MATLAB win? If you check the timings in more detail you’ll see that it all comes down to the inner loop steps still taking longer than MATLAB, and there is a clear reason for it: Julia does not have multithreading yet. If you check

A = rand(20000,20000);
B = rand(20000,20000);
A.*B

you will see that this uses all of the cores on your machine, whereas the same exact code in Julia does not (if you do A*B, Julia will because it sends that call to the Fortran program BLAS which is multi-threaded). So there it is, MATLAB wins because it’s using all 16 cores on my workstation, but Julia on 1 core holds up to it and makes a prettier plot. With this, I think Julia will win out in this example once it’s multi-threaded, but we’ll see.

What I really wanted to highlight here is that Julia is flexible enough to make both the prototype and the final code. You can code using simple syntax like in MATLAB to get things done, and then when you have a working product, start adding macros, parallelization, and type specifications to get something that both complies and runs like C. In MATLAB, you write a MATLAB code and when you get that working, you re-write all the heavy-lifting parts in C and link it up via MEX. This is quite a pain to do, and this itself is why I am investing time in Julia, knowing it will payout well.

Note: Julia’s threading branch has merged to the main branch already. They are just waiting for it to stabilize a bit to then put it in a standard release (with the major multi-threaded operations such as .* utilizing it). So stay tuned.

The post Optimizing Julia for Performance: A Practical Example appeared first on Stochastic Lifestyle.