Author Archives: Blog by Bogumił Kamiński

DataFrame vs NamedTuple: a comparison

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/12/15/table.html

Introduction

In Julia we have a common interface for working with tabular data. It is provided by the Tables.jl package.

The fact that such an interface is defined greatly simplifies interoperability between packages.
However, it introduces also a challenge. User needs to decide which concrete type of table to use?

From my experience (biased) two most common types of tables used in practice are:

  • using a NamedTuple;
  • using a DataFrame.

In this post I want to compare them so that you get a guidance which one and when to choose in your projects.

The post was tested under Julia 1.9.2 and DataFrames.jl 1.6.1.

Dependency level

NamedTuple is a type provided by Base Julia. DataFrame is defined in DataFrames.jl.

This means that you always have access to NamedTuple, while for DataFrame you need
to install DataFrames.jl and later load it.

DataFrames.jl is a relatively big package. Its installation and precompilation takes over 1 minute.
This is maybe not a huge time, but if for some reason your project environment would require
frequent recompilation it can start feeling cumbersome.

The other aspect is package load time:

julia> @time using DataFrames
  1.092920 seconds (1.27 M allocations: 78.653 MiB, 6.52% gc time, 0.46% compilation time)

Again, one second is not much, but in some applications users might want to avoid it.

In summary, NamedTuple wins here as a more lightweight option.

Conformance with Tables.jl interface

DataFrame is always a Tables.jl table.
NamedTuple is considered to be a table only if its fields are AbstracVector.
This limitation introduces an extra level of effort. User needs to ensure and check this property.

Let me give you one example when it is relevant:

julia> DataFrame(a=0, b=[1, 2]) # a table
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     0      1
   2 │     0      2

julia> (a=0, b=[1, 2]) # not a table
(a = 0, b = [1, 2])

julia> Tables.istable((a=0, b=[1,2]))
false

Additionally NamedTuple does not provide an automatic check if the lengths of all columns match:

julia> Tables.istable((a=[0], b=[1,2]))
true

julia> DataFrame((a=[0], b=[1,2]))
ERROR: DimensionMismatch: column :a has length 1 and column :b has length 2

In summary, DataFrame wins here as it is safer. With NamedTuple you need to do additional manual checks if the data you are working with is a table indeed.

Data safety

When creating a DataFrame it copies data by default (this can be overriden by copycols=false in the constructor):

julia> x = [1, 2]
2-element Vector{Int64}:
 1
 2

julia> y = [3, 4]
2-element Vector{Int64}:
 3
 4

julia> df = DataFrame(; x, y)
2×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

julia> df.x === x
false

julia> df.y === y
false

This is not a case for a NamedTuple:

julia> nt = (; x, y)
(x = [1, 2], y = [3, 4])

julia> nt.x === x
true

julia> nt.y === y
true

This design means that DataFrame is safer. Once you create it you mostly can forget about the potential risks of modifying the source data that was used to create it.
You might think that this is an issue of a minor relevance. However, in practice, not doing a copy lead to many hard-to-find bugs (and that is why we do copy data by default when creating a DataFrame).

In summary, DataFrame wins here as it is safer (especially that you can disable safe behavior by passing copycols=false to the DataFrame constructor if you wish so).

Flexibility

A big practical difference between NamedTuple and DataFrame is that NamedTuple is immutable. You cannot add, remove, or rename its columns.
On the other hand DataFrame allows for such operations, which makes it more convenient when you need to manipulate your data.

In the flexibility dimension DataFrame is a clear winner.

Performance

There is an opposite side of the flexibility coin. DataFrame is not type stable, while NamedTuple is.
This means that, if you want performance, you need to either use separate kernel functions or higher-order functions provided by DataFrames.jl (like combine or select).

Here is an example of performance unfriendly and performance friendly code for DataFrame:

julia> function sum1(table)
           s = 0
           for v in table.x
               s += v
           end
           return s
       end
sum1 (generic function with 1 method)

julia> function sum2(table)
           function kernel(x)
               s = 0
               for v in x
                   s += v
               end
               return s
           end
           return kernel(table.x)
       end
sum2 (generic function with 1 method)

julia> df = DataFrame(x=1:10^8);

julia> @time sum1(df) # after compilation
  5.810585 seconds (400.00 M allocations: 7.451 GiB, 1.78% gc time)
5000000050000000

julia> @time sum2(df) # after compilation
  0.041333 seconds (1 allocation: 16 bytes)
5000000050000000

Note that there is no such issue with NamedTuple:

julia> nt = NamedTuple(pairs(eachcol(df)))
(x = [1, 2,  …  99999999, 100000000],)

julia> @time sum1(nt) # after compilation
  0.042012 seconds (1 allocation: 16 bytes)
5000000050000000

julia> @time sum2(nt) # after compilation
  0.051452 seconds (1 allocation: 16 bytes)
5000000050000000

The winner is NamedTuple. It is easier to have good performance using it.

Compilation

The downside of NamedTuple being compiled is that it can take a lot of time to compile a function taking
it (or even to create it) if number of columns is large. DataFrame does not have such issues.

Here is an example:

julia> @time df = DataFrame(transpose(1:10_000), :auto)
  0.008328 seconds (39.53 k allocations: 2.431 MiB)
1×10000 DataFrame
 Row │ x1     x2     x3     x4     x5     x6     x ⋯
     │ Int64  Int64  Int64  Int64  Int64  Int64  I ⋯
─────┼──────────────────────────────────────────────
   1 │     1      2      3      4      5      6    ⋯
                                 9994 columns omitted

julia> @time nt = NamedTuple(pairs(eachcol(df)));
  4.660914 seconds (604.77 k allocations: 27.920 MiB, 0.20% gc time, 99.15% compilation time)

As you can see there was a significant compilation overhead of creation of nt.

For wide tables DataFrame is a clearly preferred option.

Display

DataFrame uses a nicely formatted PrettyTables.jl display. NamedTuple is not that readable.
Try displying the nt object I have created in the last section. You will get several pages of
hard-to-read output.

DataFrame has clearly superior default display mechanism.

Convenience

NamedTuple is a generic data type, while DataFrame was designed for working with tabular data specifically.
Therefore DataFrame provides numerous convenience functionalities that NamedTuple lacks. Let me give two
examples:

  • You can select a column of a DataFrame using a string or a Symbol as its name; with NamedTuple you have to use Symbol;
    allowing for strings has two big advantages: first it is slightly easier to generate column names as strings programmatically,
    second – it is easier to type column names containing special characters, like e.g. whitespace, for instance "some column name"
    is inconvenient to work with using NamedTuple.
  • You have convenient column selectors like Cols or regular expressions which work with DataFrame and are not supported by
    NamedTuple.

If you need convenience DataFrame should be your preference.

Functionality

Last, but not least DataFrame comes with dozens of convenience functions provided by DataFrames.jl
package. These include split-apply-combine, joining, sorting, subsetting, broadcasting etc. of
DataFrame objects. None of this is available for NamedTuple out of the box.
Indeed there are extra packages that work nicely with NamedTuple, but this means that you need
to install and load them separately (and usually you will need several to get your job done).

Additionally there is a host of convenience packages (like DataFramesMeta.jl or Tidier.jl) that make it easier
to work with DataFrame objects.

When it comes to functionality DataFrame is a winner.

Metadata

DataFrame supports storing table level and column level metadata (attributes in R or labels in Stata are a similar
concept). NamedTuple does not provide such a functionality.

Therefore, if you want to annotate your data DataFrame is preferable.

Concluding remarks

Let us summarize our findings:

  • NamedTuple wins in: dependency level, performance
  • DataFrame wins in: conformance with Tables.jl, data safety, flexibility, compilation, display, convenience, functionality, metadata

Given these considerations I would say that most of the time DataFrame is a safe default choice for tabular data storage format.
This is especially true for interactive workflows.

However, there are cases, when you will find NamedTuple preferable. In the Julia world usually performance gets a high priority.
NamedTuple is preferable here especially if you would have millions of small tables, as in this case the overhead of larger DataFrame
object will be noticeable.

Setting up your Julia session

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/12/08/plot.html

Introduction

A nice feature of Julia is that supports startup file that contains the code
that is executed every time you start a new Julia session.

This functionality is described in the Julia Manual.
In this post I want to discuss this functionality from the user’s perspective.

The basics of startup file

The startup file is typically located at ~/.julia/config/startup.jl folder
(I am using Linux defaults here). However, this default location can be modified.
You can change it by manipulating environment variables. These more advanced
configuration options are described here in the Julia Manual.

The startup.jl file contains the commands that are executed when Julia is run.

What are typical entries of the startup file?

Usually people load utility packages that they routinely need when working with
Julia (to avoid having to manually load them each time). Some popular examples are:

  • Revise.jl: allowing to modify source code without having to restarting Julia;
  • OhMyREPL.jl: advanced highlighting support in Julia terminal;
  • JET.jl: code analyzer for Julia;
  • BenchmarkTools.jl: performance tracking of Julia code.

Another use programmatic setting of preferences, for example:

  • setting default code editor via ENV["EDITOR"] variable;
  • automatic activation of project environment; I discussed it some time ago in this post.

Selective execution of code in startup file

Julia can be started in two modes:

  • interactive (called REPL);
  • script execution.

Some of the features are useful only in REPL mode. For example, loading OhMyREPL.jl
probably does not have much value added when executing a script.

Julia allows you to add code that should run only when it is activated in REPL mode
by registering a function that is run only in this mode. This registration is achieved
via the atreplinit function that should be defined in your startup.jl file.
It is done in the following way (the code inside the function is just an example):

atreplinit() do repl
    println("This is printed only in REPL, but not when executing a script")
end

Disabling loading of a startup file

In some cases, especially if you run some third party Julia code, you might want
to disable loading the startup.jl file. This can be achieved by
passing the --startup-file=no command line argument.

Why could you want to do it? In this way you make sure that the code you have in
your startup.jl does not conflict with the code you want to run (this situation
is rare, but is possible).

Conclusions

Some Julia users prefer not to define startup.jl at all and always be explicit
about what is loaded when Julia is started. This scenario is probably most common
when someone mostly runs Julia scripts as it ensures a clean environment and fastest
load times.

However, in many users, especially if they work interactively a lot, like to initialize
their Julia session each time it is started with some standard code.
For example, if you are a data scientist you might want to always load for example
CSV.jl, DataFrames.jl, Statistics.jl, and StatsBase.jl when you are working in REPL mode
to avoid having to load these packages manually every time you start a new session.

When working with startup.jl files there are two things worth remembering:

  1. You can selectively decide which code is loaded only if you work in REPL mode
    via the atreplinit function.
  2. Even if you have startup.jl file you can fully disable loading it by passing
    the --startup-file=no command line option.

Happy hacking!

Is Makie.jl up to speed?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/12/01/plot.html

Introduction

Makie is a plotting ecosystem for the Julia language that is extremely feature-packed and actively developed.
Recently its core package Makie.jl reached version 0.20. Its core developer Simon told me that
the package now loads much faster than it was the case in the past.

The “time to first plot” issue is often raised by new users of the Julia ecosystem as important.
Therefore a lot of work was put both by Julia core developers and by package maintainers to reduce it.

In this post I thought that it would be interesting to check how CairoMakie.jl compares to Plots.jl.
The Plots.jl package is another great plotting ecosystem for Julia. It is more lightweight, so in the past it was seen
as faster, but less feature rich. Let us see how the situation stands currently.
From the Makie ecosystem I have chosen CairoMakie.jl as I typically need 2-D production-quality plots.

The code in this post was tested under Julia Version 1.10.0-rc1, CairoMakie.jl 0.11.2 and Plots.jl 1.39.0.

Installing the package

This, and the following tests, are done in separate Julia sessions in separate project environments
for both plotting ecosystems.

The first timing we make is package installation. The results are the following:

  • CairoMakie.jl 0.11.2: 241 dependencies successfully precompiled in 227 seconds
  • Plots.jl v1.39.0: 62 dependencies successfully precompiled in 78 seconds

CairoMakie.jl takes three times more time to install and has four times more dependencies.
Installation is one-time cost, however, there are two considerations to keep in mind:

  • New users are first faced with installation time so it is noticeable (this is especially relevant with Pluto.jl).
  • Since CairoMakie.jl has many more dependencies it is more likely that it will require recompilation when any of them gets updated.

Loading the package

After installing packages we can check how long it takes to load them:

julia> @time using CairoMakie
  3.580779 seconds (2.75 M allocations: 181.590 MiB, 4.39% gc time, 1.71% compilation time: 49% of which was recompilation)

vs

julia> @time using Plots
  1.296026 seconds (1.05 M allocations: 73.541 MiB, 6.00% gc time, 2.19% compilation time)

CairoMakie.jl takes around three times more time to load. This difference is noticeable, but I think it is not a show-stopper in most cases.
Having to wait 3.5 seconds for a package to load should be acceptable unless someone expects to run a really short-lived Julia session.

Simple plotting

Now comes the time to compare plotting time. Start with CairoMakie.jl:

julia> x = range(0, 10, length=100);

julia> y = sin.(x);

julia> @time lines(x, y)
  0.559800 seconds (476.77 k allocations: 32.349 MiB, 3.13% gc time, 94.66% compilation time)

julia> @time lines(x, y)
  0.012473 seconds (32.40 k allocations: 2.128 MiB)

vs Plots.jl:

julia> x = range(0, 10, length=100);

julia> y = sin.(x);

julia> @time plot(x, y)
  0.082866 seconds (9.16 k allocations: 648.188 KiB, 97.32% compilation time)

julia> @time plot(x, y)
  0.000508 seconds (484 allocations: 45.992 KiB)

The situation repeats. CairoMakie.jl is visibly slower, but having to wait 0.5 second for a first plot to be sent to the plotting engine is I think acceptable.
Note that the consecutive plots are much faster as they do not require compilation.

Conclusions

Given the timings I have gotten my judgment is as follows:

  • CairoMakie.jl is still visibly slower than Plots.jl.
  • Yet, CairoMakie.jl is in my opinion currently fast enough not to annoy users by requiring them to wait excessively long for a plot.

I think Makie maintainers, in combination with core Julia developers,
have done a fantastic job with improving time-to-first-plot in this ecosystem.

I can say that I decided to switch to Makie as my default plotting tool for larger projects.
However, I will probably for now still use Plots.jl in scenarios when I just want to start Julia and do a single quick plot
(especially on a machine where it has to be yet installed).