Author Archives: Blog by Bogumił Kamiński

MakieCon 2023 was amazing!

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/04/21/aog.html

Introduction

This week I participated in MakieCon 2023.
The conference was devoted to Makie, which is a data visualization
ecosystem for the Julia programming language, with high performance and extensibility.

The things you already can do with Makie are mind blowing,
including 2D and 3D plots, animations, dashboards and more.
If you want to see some showcases check out the Beautiful Makie website.

Today I want to write about AlgebraOfGraphics.jl package defines
a language for data visualization, that is especially useful with tabular data.

It is natural to compare AlgebraOfGraphics.jl to ggplot2 as indeed both
ecosystems allow to declaratively create graphics. In this post I want to highlight
one important difference. The ggplot2 is designed to be a stand alone ecosystem.
While AlgebraOfGraphics.jl is an addition to Makie.jl that makes standard tabular data
visualization easy. However, you still can apply all low-level operations that Makie.jl
provides to the created plots. Therefore you get the best of both worlds:

  • a convenient declarative system for defining plots;
  • ability of easy modification of generated plots to tune them to the exact needs of a data scientist.

I want to show you how this is achieved by example.

The post was written under Julia 1.9.0-rc1, AlgebraOfGraphics v0.6.14,
CSV v0.10.9, CairoMakie v0.10.4, and DataFrames v1.5.0.

Setting up the scene

We will want to plot the following data stored in CSV format:

aog_csv = """
problem,language,time,size
binary-trees,Java,1.5886075949367087,1.0321384425216316
binary-trees,Julia,4.6075949367088604,0.7836835599505563
binary-trees,Python,28.29113924050633,0.8158220024721878
fannkuch-redux,Java,1.3825857519788918,1.408791208791209
fannkuch-redux,Julia,1.0329815303430079,1.1725274725274726
fannkuch-redux,Python,45.04617414248021,1.043956043956044
fasta,Java,1.5384615384615383,1.7382091592617908
fasta,Julia,1.4487179487179485,0.7395762132604238
fasta,Python,47.30769230769231,1.330827067669173
k-nucleotide,Java,1.2196969696969697,1.203187250996016
k-nucleotide,Julia,1.2474747474747476,0.6314741035856574
k-nucleotide,Python,11.694444444444445,1.3061088977423638
mandelbrot,Java,3.1538461538461533,0.7013215859030837
mandelbrot,Julia,1.0923076923076922,0.5453744493392071
mandelbrot,Python,136.4230769230769,0.6061674008810573
n-body,Java,3.1784037558685445,0.9118187385180649
n-body,Julia,1.9765258215962442,0.6803429271279853
n-body,Python,254.15023474178406,0.7323943661971831
pidigits,Java,1.4107142857142856,0.7009174311926606
pidigits,Julia,1.732142857142857,0.46422018348623856
pidigits,Python,2.071428571428571,0.5201834862385321
regex-redux,Java,6.675,0.6649964209019327
regex-redux,Julia,2.175,0.5433070866141733
regex-redux,Python,1.675,1.0042949176807445
reverse-complement,Java,3.829268292682927,1.110941475826972
reverse-complement,Julia,3.5121951219512195,0.26564885496183205
reverse-complement,Python,16.146341463414636,0.41424936386768446
spectral norm,Java,3.780487804878049,0.631578947368421
spectral norm,Julia,2.707317073170732,0.3583959899749373
spectral norm,Python,275.5365853658537,0.34001670843776105
"""

The CSV stores information on speed and code size of programs
written in Java, Julia, and Python relative to C for ten selected
computational problems. The data was taken from
The Computer Language Benchmarks Game website.
I discuss the interpretation of this data in Chapter 1 of Julia for Data Analysis book.

Let us first load this data to a data frame:

using AlgebraOfGraphics
using CairoMakie
using CSV
using DataFrames
aog_df = CSV.read(IOBuffer(aog_csv), DataFrame)

You should get the following output:

30×4 DataFrame
 Row │ problem             language  time       size
     │ String31            String7   Float64    Float64
─────┼───────────────────────────────────────────────────
   1 │ binary-trees        Java        1.58861  1.03214
   2 │ binary-trees        Julia       4.60759  0.783684
   3 │ binary-trees        Python     28.2911   0.815822
   4 │ fannkuch-redux      Java        1.38259  1.40879
   5 │ fannkuch-redux      Julia       1.03298  1.17253
   6 │ fannkuch-redux      Python     45.0462   1.04396
   7 │ fasta               Java        1.53846  1.73821
   8 │ fasta               Julia       1.44872  0.739576
   9 │ fasta               Python     47.3077   1.33083
  10 │ k-nucleotide        Java        1.2197   1.20319
  11 │ k-nucleotide        Julia       1.24747  0.631474
  12 │ k-nucleotide        Python     11.6944   1.30611
  13 │ mandelbrot          Java        3.15385  0.701322
  14 │ mandelbrot          Julia       1.09231  0.545374
  15 │ mandelbrot          Python    136.423    0.606167
  16 │ n-body              Java        3.1784   0.911819
  17 │ n-body              Julia       1.97653  0.680343
  18 │ n-body              Python    254.15     0.732394
  19 │ pidigits            Java        1.41071  0.700917
  20 │ pidigits            Julia       1.73214  0.46422
  21 │ pidigits            Python      2.07143  0.520183
  22 │ regex-redux         Java        6.675    0.664996
  23 │ regex-redux         Julia       2.175    0.543307
  24 │ regex-redux         Python      1.675    1.00429
  25 │ reverse-complement  Java        3.82927  1.11094
  26 │ reverse-complement  Julia       3.5122   0.265649
  27 │ reverse-complement  Python     16.1463   0.414249
  28 │ spectral norm       Java        3.78049  0.631579
  29 │ spectral norm       Julia       2.70732  0.358396
  30 │ spectral norm       Python    275.537    0.340017

Drawing a plot

Start with a complete code producing the plot:

fig = Figure(resolution = (900, 600))
aog_plt = data(aog_df) *
          mapping(:problem, [:time, :size];
                  marker=:language,
                  color=:language,
                  col = dims(1) => renamer(["", ""])) *
          visual(Scatter; markersize=15, strokewidth=1)
grid = draw!(fig, aog_plt,
             axis=(; xticklabelrotation=pi/6),
             palettes=(; marker=[:rect, :circle, :diamond],
                       color=["lightgray", "gold", "lightblue"]))
grid[1].axis.yscale = log10
hlines!(grid[1].axis, 1.0; color="red")
hlines!(grid[2].axis, 1.0; color="red")
lg = AlgebraOfGraphics.compute_legend(grid)
push!(lg[1][1], [LineElement(color="red")])
push!(lg[2][1], "C")
Legend(fig[1, end+1], lg...)
fig

It produces the following figure:

Benchmark

On the figure we can see that Julia performs quite well in both
time and code size in comparison to other programming languages.

However, today I want to focus on plotting.

The first thing is displaying the plot.
If you use Jupyter Notebook or VS Code the plot will be just shown.
However, if you work in the terminal automatic displaying is turned
off by default. You need to run the Makie.inline!(false) command
to turn it on. After issuing it each time you try to display the
figure it will be opened in your default image viewer.

Now let me comment on the plotting code.
If you have never used AlgebraOfGraphics.jl I recommend you first
read the AlgebraOfGraphics.jl manual as the code
I used is relatively advanced.

The line creating the aog_plt object is using AlgebraOfGraphics.jl
commands. As you can see using * we can link data, with mapping operations
and visualization commands. What is interesting in the presented case
is the fact that my source data contains two columns :time and :size
that I want plotted on separate subplots. This is not a problem in
AlgebraOfGraphics.jl. I just needed to pass a vector [:time, :size]
to make it work. Later the col = dims(1) => renamer(["", ""]) instruction
signals that I want to have them plotted in separate subplots. The
renamer(["", ""]) disables display of subplot titles as I have
series names shown as y-axis labels anyway.

Now notice the draw! command. It inserts the plot specification
created in AlgebraOfGraphics.jl into a Makie figure. The nice thing is
that it is possible to customize the plot by passing axis and palettes
keyword arguments.

The code that was really interesting for me comes next. The grid variable
is bound to a standard Makie object. Therefore I can later tweak as I wish
using Makie primitives. In this case what I wanted to do was:

  • Changing the y-axis of the first of the plots to logarithmic scale.
    I could easily achieve this by writing grid[1].axis.yscale = log10.
  • Adding two custom horizontal lines (representing the reference C language)
    to both plots which I achieved using the hline! command.
  • Manually adding the C language entry to the legend of the plot which
    I achieved by manipulating the lg object.

What is important is that even for someone who is not a Makie expert it was possible to
work out how the underlying objects should be mutated to get what I wanted
without a significant effort.

Of course one might want AlgebraOfGraphics.jl to be a complete plotting
system like ggplot2. However, I think that the choice made in it is correct.
I appreciate that the API of AlgebraOfGraphics.jl is relatively small.
In this way you can easily learn to do simple and standard things.
It also means that is is convenient to do exploratory data analysis
(when you often need to do a lot of simple plots).
Later, if I am to prepare a publication quality plot in which I need to perform
some non-standard customizations I can use the Makie functionality
to achieve this.

Conclusions

Here are the key things I learned during MakieCon 2023:

  • Makie ecosystem provides a lot of advanced functionalities that
    are hard to find in other plotting packages. I typically need to prepare
    high-quality plots for publications. In this aspect a particular value
    of Makie is that it allows you to tweak every detail of the plot
    the way you want.
  • AlgebraOfGraphics.jl is not an equivalent of ggplot2
    (which is a complete stand-alone plotting system). I find it easier
    to think about AlgebraOfGraphics.jl as a set of extra functions that build on top
    of Makie and provide a convenient way to create visualizations of tabular data.
    However, you can later fall-back to Makie primitives to further customize the plots if needed.

Additionally, if you want to start working with Makie ecosystem you need to know that:

  • It is essential that you read the Makie tutorial first.
    To ensure the full flexibility of how your plots are styled Makie.jl introduces
    several concepts (like Figure or Axis objects) that you need to learn if you
    want to confidently work with it.
  • Makie ecosystem provides several plotting backends that differ in functionality.
    I use CairoMakie.jl as I typically need non-interactive 2D plots that have
    publication quality. As a particular thing in this case you need to know
    that if you are working in a terminal you need to use the Makie.inline!(false)
    command to turn-on automatic rendering of a plot (this is not needed in Jupyter Notebook
    and VS Code) if you want it (alternatively you can explicitly save your plot
    to a file).

Before I finish I would like to thank Dr. Lazaro Alonso Silva for inviting me to the event.
The conference was an exceptional experience. Makie community is amazing and full of brilliant members.

I also appreciate a lot the help from Pietro Vertechi and Fabian Greimel
that gave me many useful hints about how to get the most out of AlgebraOfGraphics.jl.

How is equality checked in DataFrames.jl?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/04/14/isequal.html

Introduction

Today I want to discuss how values are tested for being equal in
functions provided by DataFrames.jl.

I already discussed the topic of equality testing in the past in
this post and this post and
explain it extensively in chapter 7 of my Julia for Data Analysis book.
However, the issue is still often raised by users, so I thought it is useful
to go back to it one more time.

The post was written under Julia 1.9.0-rc1, CategoricalArrays.jl 0.10.7 and DataFrames.jl 1.5.0.

Why equality testing is hard?

When users learn Julia they are typically taught that == is the operator
that should be used for testing for equality. Here is a basic example:

julia> 1 == 2
false

julia> 1 == 1
true

However, there are the following aspects of == that make it not intuitive
in some scenarios.

First is that == does not guarantee to return Bool value. The problem
is that if one of its arguments is missing then the result will be missing:

julia> 1 == missing
missing

julia> missing == missing
missing

Clearly this is not desirable in cases when we expect the operation to return Bool (e.g. when filtering data):

julia> x = [1, 2, missing, 4, 5]
5-element Vector{Union{Missing, Int64}}:
 1
 2
  missing
 4
 5

julia> x[x .> 2.5]
ERROR: ArgumentError: unable to check bounds for indices of type Missing

In such cases you should use the coalesce function to decide if you want to keep or drop missing values:

julia> x[coalesce.(x .> 2.5, false)]
2-element Vector{Union{Missing, Int64}}:
 4
 5

julia> x[coalesce.(x .> 2.5, true)]
3-element Vector{Union{Missing, Int64}}:
  missing
 4
 5

The second issue is that == follows IEEE semantics for floating-point numbers:

julia> NaN == NaN
false

julia> 0.0 == -0.0
true

First, you see that NaN is not considered to be equal to NaN. This can be quite surprising:

julia> x = [1, NaN, 2]
3-element Vector{Float64}:
   1.0
 NaN
   2.0

julia> x[x .== NaN]
Float64[]

instead the isnan function can be used:

julia> x[isnan.(x)]
1-element Vector{Float64}:
 NaN

The 0.0 and -0.0 case is even more tricky. These are two technically distinct values,
however, in some applications user might want them to be treated as equal, while in other
as not equal. IEEE standard determines that they are considered to be equal when compared
using ==.

In summary, the major problem with == is that it does not define a proper equivalence
relation. First, some values are not comparable (missing is returned); second
it is not reflexive (for NaN).

An alternative way to compare values

In many cases you need an equality operator that defines an equivalence relation.
In Julia this is provided by the isequal function. As you can read in its documentation:

isequal treats all floating-point NaN values as equal to each other,
treats -0.0 as unequal to 0.0, and missing as equal to missing.
Always returns a Bool value.

Let us check this:

julia> isequal(1, missing)
false

julia> isequal(missing, missing)
true

julia> isequal(NaN, NaN)
true

julia> isequal(0.0, -0.0)
false

In Julia functions that create equivalence classes over sets of some values use
isequal to test for equality. In Base Julia such are for example Dict and Set
operations or the unique function:

julia> Set([0.0, 0.0, -0.0, -0.0, NaN, NaN, missing, missing])
Set{Union{Missing, Float64}} with 4 elements:
  0.0
  NaN
  -0.0
  missing

julia> unique([0.0, 0.0, -0.0, -0.0, NaN, NaN, missing, missing])
4-element Vector{Union{Missing, Float64}}:
   0.0
  -0.0
 NaN
    missing

The same rules carry over to DataFrames.jl.

Testing for equality in DataFrames.jl

There are the following functionalities of DataFrames.jl that rely on the isequal equality test:

  • deduplication with unique and related functions;
  • grouping with groupby;
  • joins (innerjoin etc.).

Let us see them in action one by one. We start with the deduplication:

julia> using DataFrames

julia> df = DataFrame(id=1:8, x=[0.0, 0.0, -0.0, -0.0, NaN, NaN, missing, missing])
8×2 DataFrame
 Row │ id     x
     │ Int64  Float64?
─────┼──────────────────
   1 │     1        0.0
   2 │     2        0.0
   3 │     3       -0.0
   4 │     4       -0.0
   5 │     5      NaN
   6 │     6      NaN
   7 │     7  missing
   8 │     8  missing

julia> unique(df, :x)
4×2 DataFrame
 Row │ id     x
     │ Int64  Float64?
─────┼──────────────────
   1 │     1        0.0
   2 │     3       -0.0
   3 │     5      NaN
   4 │     7  missing

Indeed, we see that 0.0 and -0.0 are considered as not equal,
while NaN and missing are deduplicated.

Now let us turn to grouping:

julia> show(groupby(df, :x), allgroups=true)
GroupedDataFrame with 4 groups based on key: x
Group 1 (2 rows): x = 0.0
 Row │ id     x
     │ Int64  Float64?
─────┼─────────────────
   1 │     1       0.0
   2 │     2       0.0
Group 2 (2 rows): x = -0.0
 Row │ id     x
     │ Int64  Float64?
─────┼─────────────────
   1 │     3      -0.0
   2 │     4      -0.0
Group 3 (2 rows): x = NaN
 Row │ id     x
     │ Int64  Float64?
─────┼─────────────────
   1 │     5       NaN
   2 │     6       NaN
Group 4 (2 rows): x = missing
 Row │ id     x
     │ Int64  Float64?
─────┼─────────────────
   1 │     7   missing
   2 │     8   missing

As you can see we get the same result again. As a side note let me
comment that unique internally uses the same mechanism as groupby
to identify duplicates.

Finally let us check joins:

julia> df_ref = DataFrame(x=[0.0, missing], val=1:2)
2×2 DataFrame
 Row │ x          val
     │ Float64?   Int64
─────┼──────────────────
   1 │       0.0      1
   2 │ missing        2

julia> outerjoin(df, df_ref, on=:x)
ERROR: ArgumentError: missing values in key columns are not allowed when matchmissing == :error

We get a first problem. Joins detect that missing value is present in key column.
By default it errors in such a case. We can change it using the matchmissing keyword argument.
Let us assume that we want missing values to be treated as equal and try the following join:

julia> outerjoin(df, df_ref, on=:x, matchmissing=:equal)
ERROR: ArgumentError: currently for numeric values NaN and `-0.0` in their real or imaginary components are not allowed. Use CategoricalArrays.jl to wrap these values in a CategoricalVector to perform the requested join.

We still get an error. In joins, for safety, if -0.0 is encountered in key then an error is thrown. This can be fixed by transforming the :x column to categorical,
in which case -0.0 and 0.0 are considered to be different:

julia> using CategoricalArrays

julia> outerjoin(transform(df, :x => categorical => :x), df_ref, on=:x, matchmissing=:equal)
8×3 DataFrame
 Row │ id      x          val
     │ Int64?  Float64?   Int64?
─────┼────────────────────────────
   1 │      1        0.0        1
   2 │      2        0.0        1
   3 │      7  missing          2
   4 │      8  missing          2
   5 │      3       -0.0  missing
   6 │      4       -0.0  missing
   7 │      5      NaN    missing
   8 │      6      NaN    missing

Let us check categorical vector in more detail:

julia> levels(categorical(df.x))
3-element Vector{Float64}:
  -0.0
   0.0
 NaN

As you can see 0.0 and -0.0 are considered to be separate levels in a categorical vector.

Conclusions

I hope the examples given today were useful for understanding how == and isequal work in Julia.

As a final comment let me add that throwing an error for joins on -0.0 was a decision that was made
for safety reasons. However, if users give us a feedback that adding other options of handling -0.0
would be useful (e.g. treating them as equal or not-equal) then we could consider adding this feature
in the future releases of DataFrames.jl.

Element type surprises when processing collections in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/04/07/narrowing.html

Introduction

Today I want to write about a topic that is a quite tricky
element of design of Julia. The issue is that it is sometimes hard
to predict the element type of the output collection produced by
an operation that transforms an input collection.

The description above looks complicated, but the problem is
encountered in practice, so let me explain it by example.

The post was written under Julia 1.9.0-rc1.

A basic example

Assume you have some input collection [1, 2, 3] and you
want to compute square root of all its elements.

Let us consider three standard ways how you can do it:

julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> sqrt.(x)
3-element Vector{Float64}:
 1.0
 1.4142135623730951
 1.7320508075688772

julia> map(sqrt, x)
3-element Vector{Float64}:
 1.0
 1.4142135623730951
 1.7320508075688772

julia> [sqrt(v) for v in x]
3-element Vector{Float64}:
 1.0
 1.4142135623730951
 1.7320508075688772

As you can see in each case a proper element type, that is
Float64, was determined for the returned collection.

This behavior is useful, as the user does not have to think
about specifying the output element type. In fact,
in combination with the transformation using the identity function,
this behavior can be used to conveniently narrow
down element type of some collection:

julia> y = Any[1, 2, 3]
3-element Vector{Any}:
 1
 2
 3

julia> identity.(y)
3-element Vector{Int64}:
 1
 2
 3

julia> map(identity, y)
3-element Vector{Int64}:
 1
 2
 3

julia> [v for v in y]
3-element Vector{Int64}:
 1
 2
 3

This pattern comes handy if we have input data that does not have a known
element type, but later we want to perform element type narrowing
when processing it (one of the major benefits of such narrowing
is that processing vectors of Any values is slow so we typically want to avoid it).

The hard case

Automatic output element type detection works nice most of the time.
Unfortunately, when we work with empty collections, it becomes hard to predict.
Here is a simple example:

julia> String.([])
Any[]

julia> map(String, [])
String[]

julia> [String(v) for v in []]
Any[]

julia> string.([])
AbstractString[]

julia> map(string, [])
Any[]

julia> [string(v) for v in []]
Any[]

As you can see from it broadcating, map, and comprehension use a different set of
rules to automatically determine the produced element type. These rules of course
exist and could be learned, but the point is that the issue is non-trivial.

The problem is that when you are writing production code
(e.g. you are developing a package) you want to be sure
what the element type of the collection you produce will be, as often
you cannot know upfront if the input collection the user is going to provide
will be empty or not.

The solution I use

In situations when it matters what the element type of the collection
produced by some transformation is going to be I use comprehensions
with output element type annotation:

julia> [string(v) for v in []]
Any[]

julia> String[string(v) for v in []]
String[]

Such annotation has an additional consequence that it is going to perform
conversion of the produced elements to the target type if needed:

julia> using Test

julia> s = ["a", GenericString("a")]
2-element Vector{AbstractString}:
 "a"
 "a"

julia> [string(v) for v in s]
2-element Vector{AbstractString}:
 "a"
 "a"

julia> typeof.([string(v) for v in s])
2-element Vector{DataType}:
 String
 GenericString

julia> String[string(v) for v in s]
2-element Vector{String}:
 "a"
 "a"

julia> typeof.(String[string(v) for v in s])
2-element Vector{DataType}:
 String
 String

Note that in the example prefixing the comprehension with
String made sure that the result of the operation has
String element type and all produced values have this type.

Element type widening

Let me comment on one common related operation. What if we
want to initialize some container with a given value but
we want its element type to be wider? This is not an artificial
case – it often happens with missing (where we initialize
some container with this value only to later replace missing with proper
values).

Using the fill function is a first thing we might try:

julia> fill(missing, 3)
3-element Vector{Missing}:
 missing
 missing
 missing

However, the produced container has Missing element type which
is not useful if we e.g. wanted to later also store integers in it.

One can use a comprehension annotated with a proper output element type
instead:

julia> Union{Int, Missing}[missing for _ in 1:3]
3-element Vector{Union{Missing, Int64}}:
 missing
 missing
 missing

The pattern with missing is needed often enough that we have
a custom function in the Missings.jl package that can be used
to get the desired result more conveniently:

julia> using Missings

julia> missings(Int, 3)
3-element Vector{Union{Missing, Int64}}:
 missing
 missing
 missing

Conclusions

Fortunately, in interactive use the problem with setting
of the proper element type for some collection does not
occur often. However, when I write production programs I make
sure to always think if I need to use comprehension with
element type specification to ensure type stability of my code.