Re-posted from: https://bkamins.github.io/julialang/2022/07/15/main.html
Introduction
Many users are attracted to Julia because it offers good performance of code
execution. It is possible, because Julia programs are compiled to native
machine code before they are executed. At the same time Julia is a convenient
tool in interactive data science sessions.
However, there is a fine line between doing data analysis in an interactive
session and writing a script that is meant to be executed many times after
its creation. The problem is that often interactive session turns into
a script eventually, and there is a temptation to use the code originally
written verbatim in that script.
In this post I will explain why I personally never do it. Instead when
I realize that I will need some code many times I change my mindset
how it should be written.
The points I rise in my post today are commonly known so most likely what I
discuss is not new for many of the readers. However, I decided to write it as
keeping them in mind is especially important when coding in Julia, as opposed
to, for example, R or Python, as I will explain in this post.
The post was tested under Julia 1.7.2 and Julia 1.9.0-DEV.575.
How do I define interactive and script coding style?
There are many aspects in which a well written program is different from a log
of an interactive session. However, in my opinion, the most important one is
that in interactive session one heavily uses global variables, while in programs
preferably no global variables are used. Instead, I typically define main
function taking no arguments and I implement all top-level logic in this
function.
The three most important reasons why this is a desirable practice are:
- performance;
- “forgotten variable” problem;
- Julia’s variable scoping rules.
Let me give examples of these three potential problems.
Are loops in Julia fast?
One of the selling points of Julia is that for loops are fast, as opposed
to R or Python. However, you might ask if this is really always the case.
Let us check (start a fresh Julia session):
julia> s = 0
0
julia> @time for i in 1:10^8
s += i
end
5.766783 seconds (200.00 M allocations: 2.980 GiB, 6.07% gc time)
julia> s
5000000050000000
This is not fast. The reason is clear for any relatively experienced Julia
user. In the loop we refer to s
variable that is global. It is one of the
first things one learns when one starts using Julia.
The problem is, in my experience, that in practice it is very easy to forget
to fix such issue when switching from interactive mode to script mode if the
code base is more than a few dozen lines of code.
How would the code with top-level main
function look in this case?
(start a fresh Julia session)
julia> function main()
s = 0
for i in 1:10^8
s += i
end
return s
end
main (generic function with 1 method)
julia> @time main()
0.000001 seconds
5000000050000000
There are other options to make this code run fast in top level scope, here are
some of them.
First is to use let
block (start a fresh Julia session):
julia> s = 0
0
julia> @time s = let s = s
s = 0
for i in 1:10^8
s += i
end
s
end
0.000001 seconds (1 allocation: 16 bytes)
5000000050000000
The other is to add type assertion to global variable s
(this requires Julia 1.8,
I used Julia 1.9.0-DEV.575 in the test):
julia> s::Int = 0
0
julia> @time for i in 1:10^8
s += i
end
1.226593 seconds (100.00 M allocations: 1.490 GiB, 10.49% gc time)
julia> s
5000000050000000
As you can see type assertion makes the code run faster, but it is slower
than the other options.
However, even if these options are available I find the main
function
approach cleanest.
Why does my program produce strange results?
Here is an example program that can produce surprising results (start a fresh
Julia session):
julia> l = -100
-100
julia> function f()
x = 1
for i in l:3
x += i
end
return x
end
f (generic function with 1 method)
julia> f()
-5043
The problem I show here is standard trap, I have replaced 1
with l
in the
for
loop specification.
The issue is that if you have more than a few hundreds of lines of code it is
easy to forget what variable names you have already defined in global scope.
Then, if you mistype something in other part of the code, instead of an error
you silently get a wrong result. I have, unfortunately, fallen victim of this
trap so many times that I started to call it “forgotten variable” problem.
The risk of such problem is significantly lower if you wrap all your code in
functions (and preferably the body of the function should be short). In such
cases you most likely will remember all variables that are defined in each
function.
Why does my program change its behavior when I run it as a script?
Save the code we have used above for analyzing for
loop performance in
the example1.jl file. The contents of the file should be:
s = 0
@time for i in 1:10^8
s += i
end
println(s)
Now I try running this code in a script:
$ julia example1.jl
┌ Warning: Assignment to `s` in soft scope is ambiguous because a global
variable by the same name exists: `s` will be treated as a new local.
Disambiguate by using `local s` to suppress this warning or `global s`
to assign to the existing global variable.
└ @ example1.jl:3
ERROR: LoadError: UndefVarError: s not defined
As you can see, because of Julia’s scoping rules, the code that worked in
interactive mode stopped working in script mode. This is bad, but at least
your code threw an error.
Let us try the following code:
x = 0
for i in 1:10
x = i
end
@show x
First run it in a fresh interactive session:
julia> x = 0
0
julia> for i in 1:10
x = i
end
julia> @show x
x = 10
10
Now save this code as example2.jl file and run it as a script:
$ julia example2.jl
┌ Warning: Assignment to `x` in soft scope is ambiguous because a global
variable by the same name exists: `x` will be treated as a new local.
Disambiguate by using `local x` to suppress this warning or `global x`
to assign to the existing global variable.
└ @ example2.jl:3
x = 0
Now the code gave a warning (which is easy to ignore since it is complex, and
even might get lost in a flood of different outputs that the script produced),
but produced a result. The problem is that the result is different than the
one produced in an interactive session.
Conclusions
The problems I described in my post today are well known and most developers
remember about them when writing programs. However, in data science projects one
starts ones work in an interactive session, and only later turns the code into a
script. In such cases it is tempting to keep the code written in invractive mode
unchanged. The problems I have described today show why you should resist this
temptation and refactor your code to use functions in such a case.
What is important to keep in mind that while “forgotten variable” problem is
present in many programming languages, the performance and variable scoping
traps are Julia specific. The reason why performance trap is not so relevant
for R or Python users is that these languages are not compiled so for loops
are slow even when defined inside functions. Similarly, the scoping rule
of how global variables are treated in local soft scope is only an issue
in Julia. (If you are unsure what local soft scope is please refer to the
section on Local Scope in the Julia Manual.)
In summary, when working with Julia, you need to switch your mindset much
earlier from interactive mode to script mode than you would have to in R or
Python.