Tag Archives: performance

Using Julia’s Type System For Hidden Performance Gains

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/using-julias-type-system-hidden-performance-chunkedarrays-growablearrays-ellipsisnotation/

What I want to share today is how you can use Julia’s type system to hide performance gains in your code. What I mean is this: in many cases you may find out that the optimal way to do some calculation is not a “clean” solution. What do you do? What I want to do is show how you can define special arrays which are wrappers such that these special “speedups” are performed in the background, while having not having to keep all of that muck in your main algorithms. This is easiest to show by example.

The examples I will be building towards are useful for solving ODEs and SDEs. Indeed, these tricks have all been implemented as part of DifferentialEquations.jl and so these examples come from a real use case! They really highlight a main feature of Julia: since Julia code is fast (as long as you have type stability!), you don’t need to worry about writing code outside of Julia, and you can take advantage of higher-level features (with caution). In Python, normally one would only get speed benefits by going down to C, and so utilizing these complex objects would not get speed benefits over simply using numpy arrays in the most vectorized fashion. The same holds for R. In MATLAB… it would be really tough to implement most of this in MATLAB!

Taking Random Numbers in Chunks: ChunkedArrays

Let’s say we need to take random numbers in a loop like the following:

for i = 1:N
  dW = randn(size(u))
  #Do some things and add dW to u
end

While this is the “intuitive” code to write, it’s not necessarily the best. While there have been some improvements made since early Julia, in principle it’s just slower to make 1000 random numbers via randn() than to use randn(1000). This is because of internal speedups due to caching, SIMD, etc. and you can find mentions of this fact all over the web especially when people are talking about fast random number generators like from the VSL library.

So okay, what we really want to do is the following. Every “bufferSize” steps, create a new random number dW which is of size size(u)*bufferSize, and go through using the buffer until it is all used up, and then grab another buffer.

for i = 1:N
  if i%bufferSize == 0
    dW = randn(size(u),bufferSize)
  end
  #Do some things and add dW[..,i] to u
end

But wait? What if we don’t always use one random number? Sometimes the algorithm may need to use more than one! So you can make an integer k which tracks the current state in the buffer, and then at each point where it can be incremented, you add the conditional to grab a new buffer, etc. Also, what if you want to have the buffer generated in parallel? As you can see, code complexity explosion, just to go a little faster?

This is where ChunkedArrays come in. What I did is defined an array which essentially does the chunking/buffering in the background, so that way the code in the algorithm could be clean. A ChunkedArray is a wrapper over an array, and then used the next command to hide all of this complexity. Thus, to generate random numbers in chunks to get this speed improvement, you can use code like this:

rands = ChunkedArray(u)
for i = 1:N
  if i%bufferSize == 0
    dW = next(rands)
  end
  #Do some things and add dW[..,i] to u
end

Any time another random number is needed, you just call next. It internally stores an array and the state of the buffer, and the next function automatically check / replenishes the buffer, and can launch another process to do this in parallel if the user wants. Thus we get the optimal solution without sacrificing cleanliness. I chopped off about 10% of a runtime in Euler-Maruyama code in DifferentialEquations.jl by switching to ChunkedArrays, and haven’t thought about doing a benchmark since.

Safe Vectors of Arrays and Conversion: GrowableArrays

First let’s look at the performance difference between Vectors of Arrays and higher-dimensional contiguous arrays when using them in a loop. Julia’s arrays can take in a parametric type which makes the array hold arrays, this makes the array essentially an array of pointers. The issue here is that this adds an extra cost every time the array is dereferenced. However, for high-dimensional arrays, the : way of referencing has to generate a slice each time. Which way is more performant?

function test1()
  u = Array{Int}(4,4,3)
  u[:,:,1] = [1 2 3 4
          1 3 3 4
          1 5 6 3
          5 2 3 1]
 
  u[:,:,2] = [1 2 3 4
          1 3 3 4
          1 5 6 3
          5 2 3 1]
 
  u[:,:,3] = [1 2 3 4
          1 3 3 4
          1 5 6 3
          5 2 3 1]
 
  j = 1
  for i = 1:100000
    j += sum(u[:,:,1] + u[:,:,2] + 3u[:,:,3] + u[:,:,i%3+1] -  u[:,:,(i%j)%2+1])
  end
end
 
 
 
function test2()
  u = Vector{Matrix{Int}}(3)
  u[1] = [1 2 3 4
          1 3 3 4
          1 5 6 3
          5 2 3 1]
 
  u[2] = [1 2 3 4
          1 3 3 4
          1 5 6 3
          5 2 3 1]
 
  u[3] = [1 2 3 4
          1 3 3 4
          1 5 6 3
          5 2 3 1]
 
  j = 1
  for i = 1:100000
    j += sum(u[1] + u[2] + 3u[3] + u[i%3+1] - u[(i%j)%2+1])
  end
end
 
 
function test3()
  u = Array{Int}(4,4,4)
  u[1,:,:] = reshape([1 2 3 4
          1 3 3 4
          1 5 6 3
          5 2 3 1],(1,4,4))
 
  u[2,:,:] = reshape([1 2 3 4
          1 3 3 4
          1 5 6 3
          5 2 3 1],(1,4,4))
 
  u[3,:,:] = reshape([1 2 3 4
          1 3 3 4
          1 5 6 3
          5 2 3 1],(1,4,4))
 
  j = 1
  for i = 1:100000
    j += sum(u[1,:,:] + u[2,:,:] + 3u[3,:,:] + u[i%3+1,:,:] - u[(i%j)%2+1,:,:])
  end
end
 
#Pre-compile
test1()
test2()
test3()
 
t1 = @elapsed for i=1:10 test1() end
t2 = @elapsed for i=1:10 test2() end
t3 = @elapsed for i=1:10 test3() end
 
println("Test results: t1=$t1, t2=$t2, t3=$t3")
#Test results: t1=1.239379946, t2=0.576053075, t3=1.533462129

So using Vectors of Arrays is fast for dereferecing.

Now think about adding to an array. If you have a Vector of pointers and need to resize the array, it’s much easier to resize and copy over some pointers then it is to copy over all of the arrays. So, if you’re going to grow an array in a loop, the Vector of Arrays is the fastest implementation! Here’s a quick benchmark from GrowableArrays.jl:

using GrowableArrays, EllipsisNotation
using Base.Test
 
tic()
const NUM_RUNS = 100
const PROBLEM_SIZE = 1000
function test1()
  u =    [1 2 3 4
          1 3 3 4
          1 5 6 3
          5 2 3 1]
 
  uFull = u
  for i = 1:PROBLEM_SIZE
    uFull = hcat(uFull,u)
  end
  uFull
end
 
function test2()
  u =    [1 2 3 4
          1 3 3 4
          1 5 6 3
          5 2 3 1]
 
  uFull = u
 
  for i = 1:PROBLEM_SIZE
    uFull = vcat(uFull,u)
  end
  uFull
end
 
function test3()
  u =    [1 2 3 4
          1 3 3 4
          1 5 6 3
          5 2 3 1]
 
  uFull = Vector{Int}(0)
  sizehint!(uFull,PROBLEM_SIZE*16)
  append!(uFull,vec(u))
 
  for i = 1:PROBLEM_SIZE
    append!(uFull,vec(u))
  end
  reshape(uFull,4,4,PROBLEM_SIZE+1)
  uFull
end
 
function test4()
  u =    [1 2 3 4
          1 3 3 4
          1 5 6 3
          5 2 3 1]
 
  uFull = Vector{Array{Int}}(0)
  push!(uFull,copy(u))
 
  for i = 1:PROBLEM_SIZE
    push!(uFull,copy(u))
  end
  uFull
end
 
function test5()
  u =    [1 2 3 4
          1 3 3 4
          1 5 6 3
          5 2 3 1]
 
  uFull = Vector{Array{Int,2}}(0)
  push!(uFull,copy(u))
 
  for i = 1:PROBLEM_SIZE
    push!(uFull,copy(u))
  end
  uFull
end
 
function test6()
  u =    [1 2 3 4
          1 3 3 4
          1 5 6 3
          5 2 3 1]
 
  uFull = Vector{typeof(u)}(0)
  push!(uFull,u)
 
  for i = 1:PROBLEM_SIZE
    push!(uFull,copy(u))
  end
  uFull
end
 
function test7()
  u =    [1 2 3 4
          1 3 3 4
          1 5 6 3
          5 2 3 1]
 
  uFull = GrowableArray(u)
  for i = 1:PROBLEM_SIZE
    push!(uFull,u)
  end
  uFull
end
 
function test8()
  u =    [1 2 3 4
          1 3 3 4
          1 5 6 3
          5 2 3 1]
 
  uFull = GrowableArray(u)
  sizehint!(uFull,PROBLEM_SIZE)
  for i = 1:PROBLEM_SIZE
    push!(uFull,u)
  end
  uFull
end
 
println("Run Benchmarks")
println("Pre-Compile")
#Compile Test Functions
test1()
test2()
test3()
test4()
test5()
test6()
test7()
test8()
 
println("Running Benchmarks")
t1 = @elapsed for i=1:NUM_RUNS test1() end
t2 = @elapsed for i=1:NUM_RUNS test2() end
t3 = @elapsed for i=1:NUM_RUNS test3() end
t4 = @elapsed for i=1:NUM_RUNS test4() end
t5 = @elapsed for i=1:NUM_RUNS test5() end
t6 = @elapsed for i=1:NUM_RUNS test6() end
t7 = @elapsed for i=1:NUM_RUNS test7() end
t8 = @elapsed for i=1:NUM_RUNS test8() end
 
println("Benchmark results: $t1 $t2 $t3 $t4 $t5 $t6 $t7 $t8")
 
#Benchmark results: 1.923640854 2.131108443 0.012493308 0.00866045 0.005246504 0.00532613 0.00773568 0.00819909

As you can see in test7 and test8, I created a “GrowableArray” which is an array which acts like a Vector of Arrays. However, it has an added functionality that if you copy(G), then what you get is the contiguous array. Therefore in the loop you can grow the array the quickest way as a storage machine, but after the loop (say to plot the array), but at any time you can copy it to a contiguous array which is more suited for interop with C and other goodies.

It also hides a few painful things. Notice that in the code we pushed a copy of u (copy(u)). This is because when u is an array, it’s only the reference to the array, so if we simply push!(uFull,u), every element of uFull is actually the same item! This benchmark won’t catch this issue, but try changing u and you will see that every element of uFull changes if you don’t use copy. This can be a nasty bug, so instead we build copy() into the push!() command for the GrowableArray. This gives another issue. Since copying a GrowableArray changes it, you need to make sure push! doesn’t copy on arguments of GrowableArrays (to create GrowableArrays of GrowableArrays). However, this is easily managed via dispatch.

Helping Yourself and the Julia Community

Simple projects like these lead to re-usable solutions to improve performance while allowing for ease of use. I have just detailed some projects I have personally done (and have more to do!), but there are others that should be pointed out. I am fond of projects like VML.jl which speedup standard functions, and DoubleDouble.jl which implements efficient quad-precision numbers that you can then use in place of other number types.

I think Julia will succeed not by the “killer packages” that are built in Julia, but by a rich type ecosystem that will make everyone want to build their “killer package” in Julia.

The post Using Julia’s Type System For Hidden Performance Gains appeared first on Stochastic Lifestyle.

Optimizing Julia for Performance: A Practical Example

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/optimizing-julia-performance-practical-example/

Let’s take a program which plots the standard logistic map:

r = 2.9:.00005:4;
numAttract = 100;
steady = ones(length(r),1)*.25;
for i=1:300 ## Get to steady state
  steady = r.*steady.*(1-steady);
end
x = zeros(length(steady),numAttract);
x[:,1] = steady;
for i=2:numAttract ## Now grab some values at the attractor
  x[:,i] = r.*x[:,i-1].*(1-x[:,i-1]);
end
using PyPlot;
fig = figure(figsize=(20,10));
plot(collect(r),x,"b.",markersize=.06)
savefig("plot.png",dpi=300);

This plots the logistic map. If you take the same code and change the array dereferencing from A[i,j] to A(i,j) you get MATLAB code. However, I couldn’t get MATLAB to have it looks as good, but that’s beside the point.

MATLAB Plot
Julia Plot

What I want to show are the steps for optimizing this code. In MATLAB, this is as vectorized as it gets and so you’re pretty much done. In Julia, there are a few steps that you can do from here. The first thing to notice that if this code is run directly, then all the variables are global scope which hurts performance, so the first thing we do is change this code to a function call. Next we note that we can improve performance by AVX2 which allows the processor to take in multiple numbers into the same input. Think about it as operating on vectors of 8 numbers at a time, and when AVX3 comes out, you will be able to do 16 numbers at a time. This is something you’ll want to take advantage of! Vectorized operations already have this implemented, so to implement it in our loops we simply add @simd before our for loops. Be careful when using simd: it will allows the loops to re-order the computation. Since each simulation (different r’s) is independent, we are fine. More about this can be found here.

This gives the following code:

function logisticPlot()
  tic()
  r = 2.9:.00005:4;
  numAttract = 100;
  steady = ones(length(r),1)*.25;
  @simd for i=1:400 ## Get to steady state
    @inbounds @fastmath steady = r.*steady.*(1-steady);
  end
  x = zeros(length(steady),numAttract);
  x[:,1] = steady;
  @simd for i=2:numAttract ## Now grab some values at the attractor
    @inbounds @fastmath x[:,i] = r.*x[:,i-1].*(1-x[:,i-1]);
  end
  toc()
  fig = figure(figsize=(20,10));
  plot(collect(r),x,"b.",markersize=.06)
  savefig("plot.png",dpi=300);
end
using PyPlot
@time logisticPlot()

which does better than before. Here we can’t. Normally we can do even by explicitly defining the types which allows the LLVM to know the data types and use compiler optimizations like it was C code.

This is pretty much it because we cannot do @parallel because we need to ensure we are stepping in time without skipping steps (@parallel will re-order the timesteps). So, all in all, how well did we do? The final code takes .15 seconds on my machine to run (without the plotting). However, MATLAB is at .05. Why did MATLAB win? If you check the timings in more detail you’ll see that it all comes down to the inner loop steps still taking longer than MATLAB, and there is a clear reason for it: Julia does not have multithreading yet. If you check

A = rand(20000,20000);
B = rand(20000,20000);
A.*B

you will see that this uses all of the cores on your machine, whereas the same exact code in Julia does not (if you do A*B, Julia will because it sends that call to the Fortran program BLAS which is multi-threaded). So there it is, MATLAB wins because it’s using all 16 cores on my workstation, but Julia on 1 core holds up to it and makes a prettier plot. With this, I think Julia will win out in this example once it’s multi-threaded, but we’ll see.

What I really wanted to highlight here is that Julia is flexible enough to make both the prototype and the final code. You can code using simple syntax like in MATLAB to get things done, and then when you have a working product, start adding macros, parallelization, and type specifications to get something that both complies and runs like C. In MATLAB, you write a MATLAB code and when you get that working, you re-write all the heavy-lifting parts in C and link it up via MEX. This is quite a pain to do, and this itself is why I am investing time in Julia, knowing it will payout well.

Note: Julia’s threading branch has merged to the main branch already. They are just waiting for it to stabilize a bit to then put it in a standard release (with the major multi-threaded operations such as .* utilizing it). So stay tuned.

The post Optimizing Julia for Performance: A Practical Example appeared first on Stochastic Lifestyle.