By: pkofod
Re-posted from: http://www.pkofod.com/2017/04/24/timing-in-julia/
Timing code is important when you want to benchmark or profile your code. Is it the solution of a linear system or the Monte Carlo integration scheme that takes up most of the time? Is version A or version B of a function faster? Questions like that show up all the time. Let us have a look at a few of the possible ways of timing things in Julia.
The basics
The most basic timing functionalities in Julia are the ones included in the Base language. The standard way of timing things in Julia, is by use of the @time
macro.
julia> function test(n) A = rand(n, n) b = rand(n) @time A\b end test (generic function with 1 method)
Do note, that the code we want to time is put in a function . This is because everything we do at the top level in the REPL is in global scope. It’s a mistake a lot of people do all the time, but currently it is a very bad idea performance wise. Anyway, let’s see what happens for n = 1, 10, 100, and 1000.
julia> test(1); 0.000002 seconds (7 allocations: 320 bytes) julia> test(10); 0.000057 seconds (9 allocations: 1.313 KB) julia> test(100); 0.001425 seconds (10 allocations: 80.078 KB) julia> test(1000); 0.033573 seconds (10 allocations: 7.645 MB, 27.81% gc time) julia> test(1000); 0.045214 seconds (10 allocations: 7.645 MB, 47.66% gc time)
The first run is to compile both test
, and then we have a look at what happens when the dimension of our problem increases. Elapsed time seems to increase, and we also see that the number of allocations, and the amount of memory that was allocated increases. For the runs with dimension 1000 we see something else in the output. 30-50% of the time was spent in “gc”. What is this? Julia is a garbage collected language. This means that Julia keeps track of current allocations, and frees the memory if it isn’t needed anymore. It doesn’t do this all the time, though. Running the 1000-dimensional problem once more gives us
julia> test(1000) 0.029277 seconds (10 allocations: 7.645 MB)
We see it runs slightly faster, and there is no GC time this time around. Of course, these things will look slightly different if you try to replicate them.
So now we can time. But what if we want to store this number? We could be tempted to try
t = @time 3+3
but we will realize, that what is returned is the return value of the expression, not the elapsed time. To save the time, we can either use @timed
or @elapsed
. Let us try to change the @time
to @timed
and look at the output when we have our new test2
function return the return value.
julia> function test2(n) A = rand(n, n) b = rand(n) @timed A\b end test2 (generic function with 1 method) julia> test2(3) ([0.700921,-0.120765,0.683945],2.7889e-5,512,0.0,Base.GC_Diff(512,0,0,9,0,0,0,0,0))
We see that it returns a tuple with: the return value of A\b
followed by the elapsed time, then the bytes allocated, time spent in garbage collection, and lastly some further memory counters. This is great as we can now work with the information @time
printed, but we still have access to the results of our calculations. Of course, it is a bit involved to do it this way. If we simply wanted to see the elapsed time to act on that – then we would just use @time
as we did above.
Before we move on to some simpler macros, let us consider the last “time*-family” macro: @timev
. As we saw above, @timed
contained more information about memory allocation than @time
printed. If we want the “verbose” version, we use @timev
(v for verbose):
julia> function test3(n) A = rand(n, n) b = rand(n) @timev A\b end test3 (generic function with 1 method)
Running test3
on a kinda large problem, we see that is does indeed print the contents of Base.GC_Diff
julia> test3(5000); 1.923164 seconds (12 allocations: 190.812 MB, 4.67% gc time) elapsed time (ns): 1923164359 gc time (ns): 89733440 bytes allocated: 200080368 pool allocs: 9 malloc() calls: 3 GC pauses: 1 full collections: 1
If any of the entries are zero, the corresponding lines are omitted.
julia> test3(50); 0.001803 seconds (10 allocations: 20.828 KB) elapsed time (ns): 1802811 bytes allocated: 21328 pool allocs: 9 malloc() calls: 1
Of the three macros, you’ll probably not use @timev
a lot.
Simpler versions
If we only want the elapsed time or only want the allocations, then we used either the @elapsed
or @allocated
macros. However, these do not return the results of our calculations, so in many cases it may be easier to just used @timed
, so we can grab the results, the elapsed time, and the allocation information. “MATLAB”-style tic();toc()
‘s are also available. toc()
prints the time, while toq()
is used if we want only the returned time without the printing. It is also possible to use time_ns()
to do what time.time()
would do in Python, although for practically all purposes, the above macros are recommended.
More advanced functionality
Moving on to more advanced features, we venture into the package ecosystem.
Nested timings
The first package I will present is the nifty TimerOutputs.jl
by Kristoffer Carlsson. This packages essentially allows you to nest @time
calls. The simplest way to show how it works, is to use the example posted at the announcement (so credit to Kristoffer for the example).
using TimerOutputs # Create the timer object to = TimerOutput() # Time something with an assigned label @timeit to "sleep" sleep(0.3) # Data is accumulated for multiple calls for i in 1:100 @timeit to "loop" 1+1 end # Nested sections are possible @timeit to "nest 1" begin @timeit to "nest 2" begin @timeit to "nest 3.1" rand(10^3) @timeit to "nest 3.2" rand(10^4) @timeit to "nest 3.3" rand(10^5) end rand(10^6) end
Basically we’re timing the sleep call in one time counter, all the additions in the loop in another counter, and then we do some nested generation of random numbers. Displaying the to
instance gives us something like the following
─────────────────────────────────────────────────────────────────────── Time Allocations ────────────────────── ─────────────────────── Tot / % measured: 6.48s / 5.60% 77.4MiB / 12.0% Section ncalls time %tot avg alloc %tot avg ─────────────────────────────────────────────────────────────────────── sleep 1 338ms 93.2% 338ms 804KiB 8.43% 804KiB nest 1 1 24.7ms 6.80% 24.7ms 8.52MiB 91.5% 8.52MiB nest 2 1 9.10ms 2.51% 9.10ms 899KiB 9.43% 899KiB nest 3.1 1 3.27ms 0.90% 3.27ms 8.67KiB 0.09% 8.67KiB nest 3.3 1 3.05ms 0.84% 3.05ms 796KiB 8.34% 796KiB nest 3.2 1 2.68ms 0.74% 2.68ms 92.4KiB 0.97% 92.4KiB loop 100 6.97μs 0.00% 69.7ns 6.08KiB 0.06% 62B ───────────────────────────────────────────────────────────────────────
which nicely summarizes absolute and relative time and memory allocation of the individual @timeit
calls. A real use case could be to see what the effect is of using finite differencing to construct the gradient for the Generalized Rosenbrock (GENROSEN) problem from CUTEst.jl using a conjugate gradient solver in Optim.jl.
using CUTEst, Optim, TimerOutputs nlp = CUTEstModel("GENROSE") const to = TimerOutput() f(x ) = @timeit to "f" obj(nlp, x) g!(g, x) = @timeit to "g!" grad!(nlp, x, g) begin reset_timer!(to) @timeit to "Conjugate Gradient" begin res = optimize(f, g!, nlp.meta.x0, ConjugateGradient(), Optim.Options(iterations=5*10^10)); println(Optim.minimum(res)) end @timeit to "Conjugate Gradient (FiniteDiff)" begin res = optimize(f, nlp.meta.x0, ConjugateGradient(), Optim.Options(iterations=5*10^10)); println(Optim.minimum(res)) end show(to; allocations = false) end
the output is a table as before, this time without the allocations (notice the use of the allocations keyword in the show
method)
──────────────────────────────────────────────────────────────── Time ────────────────────── Tot / % measured: 33.3s / 100% Section ncalls time %tot avg ──────────────────────────────────────────────────────────────── Conjugate Gradient (FiniteDiff) 1 33.2s 99.5% 33.2s f 1.67M 32.6s 97.9% 19.5μs Conjugate Gradient 1 166ms 0.50% 166ms g! 1.72k 90.3ms 0.27% 52.6μs f 2.80k 59.1ms 0.18% 21.1μs ────────────────────────────────────────────────────────────────
And we conclude: finite differencing is very slow when you’re solving a 500 dimensional unconstrained optimization problem, and you really want to use the analytical gradient if possible.
Benchmarking
Timing individual pieces of code can be very helpful, but when we’re timing small function calls, this way of measuring performance can be heavily influenced by noise. To remedy that, we use proper benchmarking tools. The package for that, well, it’s called BenchmarkTools.jl and is mainly written by Jarrett Revels. The package is quite advanced in its feature set, but its basic functionality is straight forward to use. Please see the manual for more details than we provide here.
Up until now, we’ve asked Julia to tell us how much time some code took to run. Unfortunately for us, the computer is doing lots of stuff besides the raw calculations we’re trying to time. From the example earlier, this means that we have a lot of noise in our measure of the time it takes to solve A\b
. Let us try to run test(1000)
a few times
julia> test(1000); 0.029859 seconds (10 allocations: 7.645 MB) julia> test(1000); 0.033381 seconds (10 allocations: 7.645 MB, 6.41% gc time) julia> test(1000); 0.024345 seconds (10 allocations: 7.645 MB) julia> test(1000); 0.039585 seconds (10 allocations: 7.645 MB) julia> test(1000); 0.037154 seconds (10 allocations: 7.645 MB, 2.82% gc time) julia> test(1000); 0.024574 seconds (10 allocations: 7.645 MB) julia> test(1000); 0.022185 seconds (10 allocations: 7.645 MB)
There’s a lot of variance here! Let’s benchmark instead. The @benchmark
macro won’t work inside a function as above. This means that we have to be a bit careful (thanks to Fengyang Wang for clarifying this). Consider the following
julia> n = 200; julia> A = rand(n,n); julia> b = rand(n); julia> @benchmark A\b BenchmarkTools.Trial: memory estimate: 316.23 KiB allocs estimate: 10 -------------- minimum time: 531.227 μs (0.00% GC) median time: 718.527 μs (0.00% GC) mean time: 874.044 μs (3.12% GC) maximum time: 95.515 ms (0.00% GC) -------------- samples: 5602 evals/sample: 1
This is fine, but since A and b are globals (remember, if it ain’t wrapped in a function, it’s a global when you’re working from the REPL), we’re also measuring the time dynamic dispatch takes. Dynamic dispatch happens here, because “Julia” cannot be sure what the types of A and b are when we invoke A\b
since they’re globals. Instead, we should use interpolation of the non-constant variables, or mark them as constants using const A = rand(n,n)
and const b = rand(20)
. Let us use interpolation.
julia> @benchmark $A\$b BenchmarkTools.Trial: memory estimate: 316.23 KiB allocs estimate: 10 -------------- minimum time: 531.746 μs (0.00% GC) median time: 717.269 μs (0.00% GC) mean time: 786.240 μs (3.22% GC) maximum time: 12.463 ms (0.00% GC) -------------- samples: 6230 evals/sample: 1
We see that the memory information is identical to the information we got from the other macros, but we now get a much more robust estimate of the time it takes to solve our A\b
problem. We also see that dynamic dispatch was negligible here, as the solution takes much longer to compute than for Julia to figure out which method to call. The @benchmark
macro will do various things automatically, to try to give as accurate results as possible. It is also possible to provide custom tuning parameters, say if you’re running these benchmarks over an extended period of time and want to track performance regressions, but that is beyond this blog post.
Dynamic dispatch
Before we conclude, let’s have a closer look at the significance of dynamic dispatch. When using globals it has to be determined at run time which method to call. If there a few methods, this may not be a problem, but the problem begins to show itself when a function has a lot of methods. For example, on Julia v0.5.0, identity
has one method, but +
has 291 methods. Can we measure the significance of dynamic dispatch then? Sure. Just benchmark with, and without interpolation (thanks again to Fengyang Wang for cooking up this example). To keep output from being too verbose, we’ll use the @btime
macro – again from BenchmarkTools.jl
.
julia> x = 0 0 julia> @btime identity(x) 1.540 ns (0 allocations: 0 bytes) 0 julia> @btime +x 15.837 ns (0 allocations: 0 bytes) 0 julia> @btime identity($x) 1.540 ns (0 allocations: 0 bytes) 0 julia> @btime +$x 1.548 ns (0 allocations: 0 bytes) 0
As we can see, calling +
on the global x
takes around 10 times a long as the single method function identity
. To show that declaring the input a const
and interpolating the variable gives the same result, consider the example below.
julia> const y = 0 0 julia> @btime identity(y) 1.539 ns (0 allocations: 0 bytes) 0 julia> @btime +y 1.540 ns (0 allocations: 0 bytes) 0 julia> @btime identity($y) 1.540 ns (0 allocations: 0 bytes) 0 julia> @btime +$y 1.540 ns (0 allocations: 0 bytes) 0
We see that interpolation is not needed, as long as we remember to use constants.
Conclusion
There are quite a few ways of measuring performance in Julia. I’ve presented some of them here, and hopefully you’ll be able to put the tools to good use. The functionality from Base is good for many purposes, but I really like the nested time measuring in TimerOutputs.jl a lot, and for serious benchmarking it is impossible to ignore BenchmarkTools.jl.