Re-posted from: https://bkamins.github.io/julialang/2022/12/09/lessons.html
Introduction
Toshiro Kageyama was an amateur Go player who became a successful
professional. His story had a profound influence on my thinking about to data
science and programming.
Let me start with a quote about his understanding of how go players differ
(taken from the Lessons In The Fundamentals of Go book):
Once it was thought (…) that amateurs could not even approach the
professional level. Nowadays, however, the great surge in the size of the
Go-playing public has narrowed the distance has narrowed the distance between
the tracks.
Does this sound familiar? In the past people that created complex, large scale
machine learning solutions, were mostly PhDs in computer science or similar.
Now the situation has changed. Thousands of developers, with various
backgrounds, successfully create amazing projects.
Toshiro Kageyama notes that every Go-player wants to get stronger. Likewise,
every data scientist, amateur or professional, wants to be able to create
better models, faster, and more reliably.
So what is the secret for becoming stronger? The major message of Toshiro
Kageyama is that you need to learn and focus on the fundamentals. In the book
he gives numerous examples from Go, baseball, sumo wrestling, or even cooking,
highlighting how important it is to follow basic principles of your trade no
matter if you are amateur or professional. He stresses that in order to
progress you cannot take shortcuts hoping that you can bypass fundamentals. In
short term this might give some effects, but will not work in the long run.
What does all this have to do with Julia you probably wonder by now.
I have been writing the Julia For Data Analysis book for a year and
it has been published this week. In this post I want to discuss what important
fundamental principle I re-learned many times during this process.
Always crate project environment for your code
One of the basic things that any data scientist learns is that any project
should be accompanied by a complete specification of environment in which
it should be run.
In Julia this is achieved via sharing Project.toml and Manifest.toml files as
is explained in the Pkg.jl manual.
So far all seems easy. If you want reproducibility of your work create the
Project.toml and Manifest.toml files for your code and Julia will
automatically instantiate the required environment.
This is indeed a golden rule that works like magic. The fact that Julia has
such excellent package manager built into the language is one of the things
that make me use it.
However, when I was writing a book I faced several challenges that are similar
to problems that large projects developed over a long period of time face.
Changes in versions of Julia
So what was my major problem? I wrote the book under Julia 1.7.
After I finished it Julia 1.8 was released.
Why is it an issue? Well – version of Julia is part of the specification
of the project environment. Therefore, technically, I should expect my readers
to install Julia 1.7 when running examples from my book.
However, this is what I wanted to avoid. It is natural to expect that users
will want to use latest version of Julia when learning it. Therefore I had
to make sure that my codes work both under Julia 1.7 and Julia 1.8.
And here I hit the first problem – not all that works under Julia 1.7 works
under Julia 1.8. You might be surprised by this, as Julia 1.8 should not
introduce any breaking changes. However, the major problem were binary
dependencies, that is external code dependencies that are not written in Julia.
Such dependencies sometimes have versions that are Julia-version specific. This
is an observation that I recommend you keep in mind when working on some
project that is expected to be worked on for a longer period of time.
Binary dependencies, round two
Julia has an excellent integration with Python. You can easily and conveniently
execute Python code as a part of your Julia programs. I wanted to show this
functionality in the book. As an example I used t-SNE from scikit-learn
(as a side note: there are very nice implementations of this algorithm in pure
Julia in TSne.jl or Manifolds.jl).
Guess what happened next? I got many requests from the readers indicating they
had problems with proper setting up of Python on their machines. Although there
are packages in Julia that help to automatically perform Python setup
(PyCall.jl and Conda.jl), still the diverse range of end-user environment
configurations meant that ensuring that proper Python executable along with all
proper Python packages is available is not always trivial.
Handling of bug fixes
Another problem I had was handling of bug fixes in packages. The (real)
scenario is as follows: I have finished writing a book and released all the
codes and after some time a serious bug in some package is revealed (by serious
I mostly mean things like security threats or severe memory leaks; guess what –
they mostly happen when some package has binary dependencies).
Now the problem is as follows: I have some package in version e.g. 4.3.1
defined in my Project.toml, but the bug is found and fixed only after the
package has version e.g. 5.1.2.
You most likely already see where the problem is. Typically the bug fix is not
backported to 4.x branch of the package. I need to update the version of the
package to 5.x branch (and then a cascade of other packages gets updated). In
the end – I need to re-run all my source code for the book to check if it still
works both under Julia 1.7 and 1.8.
The reality is that it did not work. And here let me comment on two reasons
why:
- I had to make a small adjustment because version 5.x of the package made some
breaking change in the API. It is a problem but not super big – since I got
an error indicating that API changed. - The second problem is more subtle. In the packages that got recursively
updated some of them changed minor or patch versions. You would assume that
nothing should break then. And here we hit a problem that is really serious
(and is my call to package developers, if they still read this text):
Do not use internal APIs of packages on which your package depends!
(another fundamental rule to re-learn)
In my case this is what happened. Some package had a version change and
modified its internal API. This is a non-breaking change, so all should be
good. Unfortunately, it was not, as some other package relied on this
internal API and stopped working.
Why I still think Julia package manager is great
Someone might say that all these problems are easily solved by contenerization.
Indeed, I could have done this, but I did not want to. There are two reasons:
- First, I want readers of my book to be exposed to real experience of using
Julia – in all its aspects. And in practice you expect that you will use
Julia installed on your machine and need to install packages. - Second, Julia package manager is excellent. Although I had problems I listed
above, the fact is that they are minor and solvable.
Let me comment on several features of Julia package manager that are especially
useful:
- When you add or update package you can choose resolution tier that
precisely describes what changes of versions of dependencies of the package
are allowed (ranging from: preserve all to update all to the latest and
greatest). This is important as when I only want to make a bug-fix update
I want to allow as few changes to the other packages as possible. - Package management is lightweight. Although each of your projects can have
a bit different set of packages and their versions it depends on, Julia
package manager keeps a federated repository of the packages. This means that
a specific version of the package needs to be installed only once per machine;
this saves a lot of time and disk space. - It has an excellent support for artifacts, that can be any
binary dependencies of the package; in particular it allows for shipping
pre-built platform-specific dependencies. So although the binaries can be
Julia-version dependent you do not need any extra external tools to build
such packages. And the beauty of this solution is that it works
on Linux, Mac, and Windows (think if have you ever had a situation that some
package does not want to build on Linux when you e.g. used R).
Final thoughts
Today we talk about fundamentals. I also tried to write my
Julia for Data Analysis book in a way that covers fundamentals of
Julia for data analysis.
For this reason I decided to skip discussing advanced machine learning
algorithms or integration options with various data sources. Instead, I cover
in the book essential topics that a data scientist will relatively soon need to
learn about the language (while skipping topics that are most likely less
important). The fundamentals are divided into two parts in the book:
- Part 1: the Julia language (e.g. various data container types, multiple
dispatch, various ways to work with strings). - Part 2: the data analysis ecosystem (e.g. working with data frames, plotting,
representing relationships in data using graphs).
With this book I hope to convince the readers that Julia language was designed
so that you can easily pick it up but go very far to become a top data science
developer. So, in principle, it sticks to what Toshiro Kageyama said: it allows
a wide community of analysts to be able to do what in the past was only
possible for experts.
Let me give a single example of this kind. When starting to work with Julia
users are often confused by ->
(syntax defining an anonymous function), =>
(a notation allowing you to create key-value pairs), ==(1)
(a way to create a
function that compares its argument to 1
using ==
, a.k.a partial function
application), or .
(symbol used when broadcasting a function).
Often, when reading some advanced code such new users see expressions like
(tested under Julia 1.8.2, DataFrames.jl 1.4.4, and DataFramesMeta.jl 0.12.0):
julia> using DataFrames
julia> df = DataFrame(a = [1, 3, 2, 1, 4])
5×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 3
3 │ 2
4 │ 1
5 │ 4
julia> transform(df, :a => ByRow(==(1)) => :a_eq_1)
5×2 DataFrame
Row │ a a_eq_1
│ Int64 Bool
─────┼───────────────
1 │ 1 true
2 │ 3 false
3 │ 2 false
4 │ 1 true
5 │ 4 false
They might feel intimidated by the complexity of the syntax.
While they could just write:
julia> using DataFramesMeta
julia> @rtransform(df, :a_eq_1 = :a == 1)
5×2 DataFrame
Row │ a a_eq_1
│ Int64 Bool
─────┼───────────────
1 │ 1 true
2 │ 3 false
3 │ 2 false
4 │ 1 true
5 │ 4 false
The first style (using transform
) is an underlying operation specification
syntax of DataFrames.jl: designed to be flexible and allowing for expression of
very complex operations. The second style (using @rtransform
) is a domain
specific language that is designed to be easy to pick up (dplyr users can
mostly just start using it without learning much).
In the book I cover both. Most likely you want to start with an easy syntax
that allows you to just get you the result you desire. However, I also explain
all the fundamentals of both Julia and its data analysis ecosystem, so when you
need to perform more complex tasks you will know all the required features and
you are ready to use them.
For example in the last chapter of the book, I show how to create and use a
web service that performs valuation of Asian options using multi-threading
to ensure its fast response time. As you can imagine, it is a relatively
complex (but fortunately short) code, and you need to have a good grasp of
fundamentals to easily understand it. If you would like to learn why similar
solutions are written using Julia in the industry I recommend you check out
Timeline case study.