Author Archives: Blog by Bogumił Kamiński

Getting ready for JuliaCon Local Eindhoven 2023

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/10/13/juliacon.html

Introduction

I am really excited that soon we will have a Julia conference in Europe.
The JuliaCon Local Eindhoven 2023 will take place on December 1, 2023 in Eindhoven.

I thought that as nice preparation for this event it would be useful to share
some tips that can be useful for DataFrames.jl users.
Frequently, when working with tables, one has to process time series data.
I want to share a few examples how such data can be stored in DataFrames.jl.

The post was written under Julia 1.9.2 and DataFrames.jl 1.6.1.

Setting up the stage

Let us start with a data frame that has 4 columns and 1000 rows representing consecutive measurements of some data.
I also add the :period column keeping track of the moment at which the measurement was made.

julia> using DataFrames

julia> using Random

julia> Random.seed!(1234);

julia> df1 = DataFrame(rand(1000, 4), :auto);

julia> df1.period = 1:1000;

julia> df1
1000×5 DataFrame
  Row │ x1         x2         x3        x4         period
      │ Float64    Float64    Float64   Float64    Int64
──────┼───────────────────────────────────────────────────
    1 │ 0.579862   0.473374   0.366288  0.771849        1
    2 │ 0.411294   0.176997   0.652475  0.938484        2
    3 │ 0.972136   0.676856   0.900155  0.390673        3
    4 │ 0.0149088  0.177963   0.374975  0.694063        4
  ⋮   │     ⋮          ⋮         ⋮          ⋮        ⋮
  997 │ 0.424295   0.78012    0.577681  0.274228      997
  998 │ 0.481554   0.0441882  0.700181  0.317677      998
  999 │ 0.401528   0.658903   0.438309  0.211217      999
 1000 │ 0.58985    0.909857   0.354735  0.442829     1000
                                          992 rows omitted

Reshaping data: wide to long

The first operation we want to do is reshaping the data from wide to long format,
in which we will have three columns: the period, the variable name, and the variable value.
To achieve this we can use the stack function:

julia> df2 = stack(df1, Not(:period))
4000×3 DataFrame
  Row │ period  variable  value
      │ Int64   String    Float64
──────┼─────────────────────────────
    1 │      1  x1        0.579862
    2 │      2  x1        0.411294
    3 │      3  x1        0.972136
    4 │      4  x1        0.0149088
  ⋮   │   ⋮        ⋮          ⋮
 3997 │    997  x4        0.274228
 3998 │    998  x4        0.317677
 3999 │    999  x4        0.211217
 4000 │   1000  x4        0.442829
                   3992 rows omitted

Reshaping data: long to wide

Now let us perform the reverse operation, but this time put :period as columns.
We can do it using the unstack function:

julia> df3 = unstack(df2, :variable, :period, :value)
4×1001 DataFrame
 Row │ variable  1         2         3         4          5         6        ⋯
     │ String    Float64?  Float64?  Float64?  Float64?   Float64?  Float64? ⋯
─────┼────────────────────────────────────────────────────────────────────────
   1 │ x1        0.579862  0.411294  0.972136  0.0149088  0.520355  0.639562 ⋯
   2 │ x2        0.473374  0.176997  0.676856  0.177963   0.670122  0.042361
   3 │ x3        0.366288  0.652475  0.900155  0.374975   0.387857  0.779594
   4 │ x4        0.771849  0.938484  0.390673  0.694063   0.724383  0.130453
                                                           995 columns omitted

Nesting columns

Another way to reshape this data is to put all periods in a single cell of a data frame.
We can do it in two ways.

The first is to keep them in columns :x1, :x2, :x3, and :x4. Here are two ways
how you can do it starting from df1 or df2:

julia> combine(df1, Not(:period) .=> Ref, renamecols=false)
1×4 DataFrame
 Row │ x1                                 x2                                 ⋯
     │ Array…                             Array…                             ⋯
─────┼────────────────────────────────────────────────────────────────────────
   1 │ [0.579862, 0.411294, 0.972136, 0…  [0.473374, 0.176997, 0.676856, 0…  ⋯
                                                             2 columns omitted
julia> unstack(df2, [], :variable, :value, combine=collect)
1×4 DataFrame
 Row │ x1                                 x2                                 ⋯
     │ Array…?                            Array…?                            ⋯
─────┼────────────────────────────────────────────────────────────────────────
   1 │ [0.579862, 0.411294, 0.972136, 0…  [0.473374, 0.176997, 0.676856, 0…  ⋯
                                                             2 columns omitted

Alternatively, we could want to have two columns, the first with variable names
and the second with the vectors. Again, let me show two ways to do it starting
from df2 and df3 data frames:

julia> df4 = combine(groupby(df2, :variable), :value => Ref, renamecols=false)
4×2 DataFrame
 Row │ variable  value
     │ String    SubArray…
─────┼─────────────────────────────────────────────
   1 │ x1        [0.579862, 0.411294, 0.972136, 0…
   2 │ x2        [0.473374, 0.176997, 0.676856, 0…
   3 │ x3        [0.366288, 0.652475, 0.900155, 0…
   4 │ x4        [0.771849, 0.938484, 0.390673, 0…

julia> combine(df3, :variable, AsTable(Not(:variable)) => ByRow(identity∘collect) => :value)
4×2 DataFrame
 Row │ variable  value
     │ String    Array…
─────┼─────────────────────────────────────────────
   1 │ x1        Union{Missing, Float64}[0.579862…
   2 │ x2        Union{Missing, Float64}[0.473374…
   3 │ x3        Union{Missing, Float64}[0.366288…
   4 │ x4        Union{Missing, Float64}[0.771849…

These examples are slightly more complicated so let me explain them.

In the first one we use Ref to protect a vector against expansion into multiple columns
(note that no copying of data happens here, to perform a copy you should write Ref∘copy).

In the second one the AsTable(Not(:variable)) source column selector produces a NamedTuple.
However, Julia users probably are aware that with 1000 entries such a NamedTuple would take a long
time to compile. For this reason we are using an optimization that DataFrames.jl provides.
If we pass a function of a form some_function∘collect then this compilation step is avoided.
In our case the function is just identity as we are happy with producing a vector (which collect already returns).

Unnesting columns

As a last example let us show how to expand the df4 data frame back into multiple columns.
This is easy. Just write:

julia> select(df4, :variable, :value => AsTable)
4×1001 DataFrame
 Row │ variable  x1        x2        x3        x4         x5        x6       ⋯
     │ String    Float64   Float64   Float64   Float64    Float64   Float64  ⋯
─────┼────────────────────────────────────────────────────────────────────────
   1 │ x1        0.579862  0.411294  0.972136  0.0149088  0.520355  0.639562 ⋯
   2 │ x2        0.473374  0.176997  0.676856  0.177963   0.670122  0.042361
   3 │ x3        0.366288  0.652475  0.900155  0.374975   0.387857  0.779594
   4 │ x4        0.771849  0.938484  0.390673  0.694063   0.724383  0.130453
                                                           995 columns omitted

Since we used AsTable as target column names specifier we got auto-generated column names.

Conclusions

I hope you liked the examples I presented today and learned something from them.

I am sure that if you attend JuliaCon Local Eindhoven 2023 you are going to
see a lot of highly informative and inspirational talks there!

Column unioning in Tables.jl: row vs column oriented storage

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/10/06/tables.html

Introduction

Today I want to continue exploration of Tables.jl package functionality.
Some time ago I wrote a post about getting a schema of Tables.jl tables.
In the post I discussed, in particular, “column unioning”.

The column unioning approach allows you to put data with heterogenous entries into a single table
by filling the missing entries with missing values.

In my previous post I discussed the Tables.dictrowtable function that performs column unioning.
Today I want to introduce you to the Tables.dictcolumntable function that has a similar functionality,
but uses a different storage format.

The post was written using Julia 1.9.2, Tables.jl 1.11.0, and DataFrames.jl 1.6.1.

Introduction: an example of column unioning

Let us start with a simple example of column unioning behavior:

julia> using DataFrames

julia> vnt = [(a=1, b=2), (a=3, c=4), (b=5, d=6)]
3-element Vector{NamedTuple{names, Tuple{Int64, Int64}} where names}:
 (a = 1, b = 2)
 (a = 3, c = 4)
 (b = 5, d = 6)

julia> DataFrame(Tables.dictrowtable(vnt))
3×4 DataFrame
 Row │ a        b        c        d
     │ Int64?   Int64?   Int64?   Int64?
─────┼────────────────────────────────────
   1 │       1        2  missing  missing
   2 │       3  missing        4  missing
   3 │ missing        5  missing        6

julia> DataFrame(Tables.dictcolumntable(vnt))
3×4 DataFrame
 Row │ a        b        c        d
     │ Int64?   Int64?   Int64?   Int64?
─────┼────────────────────────────────────
   1 │       1        2  missing  missing
   2 │       3  missing        4  missing
   3 │ missing        5  missing        6

We have a heterogeneous table vnt (a vector of named tuples).
Each of its rows has a different set of columns.
After column unioning operation we get a table (displayed as a data frame in our example) where each row has four columns and the missing values are filled with missing.

Such situations are quite common when one e.g. processes JSON data and each object may have a varying set of attributes.

Why do we have two functions that perform column unioning?

A natural question to ask is why do we have Tables.dictrowtable and Tables.dictcolumntable that perform the same operation?

The reason is simple. Tables.dictrowtable returns a row-oriented object, while Tables.dictcolumntable a column oriented one.

We can easily check it by digging into the internals of these objects:

julia> getfield(Tables.dictrowtable(vnt), :values)
3-element Vector{Dict{Symbol, Any}}:
 Dict(:a => 1, :b => 2)
 Dict(:a => 3, :c => 4)
 Dict(:b => 5, :d => 6)

julia> getfield(Tables.dictcolumntable(vnt), :values)
OrderedCollections.OrderedDict{Symbol, AbstractVector} with 4 entries:
  :a => Union{Missing, Int64}[1, 3, missing]
  :b => Union{Missing, Int64}[2, missing, 5]
  :c => Union{Missing, Int64}[missing, 4, missing]
  :d => Union{Missing, Int64}[missing, missing, 6]

As you can see Tables.dictrowtable internally stores a vector of dictionaries, while Tables.dictcolumntable a dictionary of vectors.

So the question is when you should use which? A simple answer is that most of the time Tables.dictcolumntable is what you want,
unless you want to process data row-by-row and the data is very sparse. The reason is that creating a Dict for each row of the data
has a big overhead. Additionally, most often analytical routines expect data stored in columns. For example DataFrame has a columnar storage.
Therefore, creating a data frame from a Tables.dictcolumntable object will be much faster.

Let us make a small test (I repeat the same @time call several times to capture the variability of the results and make sure all gets compiled):

julia> vnt2 = repeat(vnt, 10^6);

julia> @time DataFrame(Tables.dictrowtable(vnt2));
  9.625842 seconds (99.00 M allocations: 5.018 GiB, 20.14% gc time)

julia> @time DataFrame(Tables.dictrowtable(vnt2));
  9.792331 seconds (99.00 M allocations: 5.018 GiB, 24.41% gc time)

julia> @time DataFrame(Tables.dictcolumntable(vnt2));
  2.029009 seconds (30.00 M allocations: 892.629 MiB)

julia> @time DataFrame(Tables.dictcolumntable(vnt2));
  2.251707 seconds (30.00 M allocations: 892.629 MiB)

Indeed we have a faster execution time, less allocations, and less memory allocated with Tables.dictcolumntable.

If we wanted to squeeze out more performance we could pass copycols=false to DataFrame constructor since we know
that in this case we can safely skip making copies of the vectors representing columns that are passed to the constructor:

julia> @time DataFrame(Tables.dictcolumntable(vnt2), copycols=false);
  2.584373 seconds (30.02 M allocations: 790.861 MiB, 17.83% gc time, 3.40% compilation time)

julia> @time DataFrame(Tables.dictcolumntable(vnt2), copycols=false);
  2.139848 seconds (30.00 M allocations: 789.632 MiB)

julia> @time DataFrame(Tables.dictcolumntable(vnt2), copycols=false);
  1.959573 seconds (30.00 M allocations: 789.632 MiB)

Indeed we get less allocations, but the timing of the operation is only minimally better.

Conclusions

I dedicated my post today only to two functions Tables.dictrowtable and Tables.dictcolumntable.
The reason is that they are not commonly known, and at the same time they are very useful if you
need column unioning behavior when working with your source data. Happy hacking!

DataFrames.jl at the Journal of Statistical Software

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/09/29/dataframes.html

Introduction

This week, I have reached a small personal milestone as
the Journal of Statistical Software has just published
the DataFrames.jl: Flexible and Fast Tabular Data in Julia
paper that I co-authored with Milan Bouchet-Valat.

Therefore, in this post, I thought to summarize various resources
I have been working on over the years that can help you master DataFrames.jl.

Types of documentation

It is hard to document a package properly. The reason is that different users
have different expectations. In this post there is an excellent summary of
four typical kinds of documentation:

  • learning-oriented tutorials;
  • goal-oriented how-to guides;
  • understanding-oriented discussions;
  • information-oriented reference material.

Each kind of documentation requires a slightly different approach, and
having it all in a single place is hard. In the following sections, I will go through
various materials I have prepared over the years and explain my intention
behind them.

The Journal of Statistical Software paper

The objective of the DataFrames.jl: Flexible and Fast Tabular Data in Julia paper
is understanding-oriented. Together with Milan we tried to explain in it how the
DataFrames.jl package was designed, and what were the motivations behind these decisions.

For this reason, you most likely cannot learn how to work with DataFrames.jl after reading
the paper. However, you will get an intuition what are the basic building blocks
of the package.

It is similar to learning languages. I have recently decided to learn French.
As a part of this process, I watched the ALL THE RULES OF FRENCH IN 20 MINUTES video on YouTube.
I did not directly learn French from it, but it helped a lot in understanding the “design” of the French language.

Julia for Data Analysis book

Another major resource I have created is my Julia for Data Analysis book.
This resource is learning-oriented. I start it from the basics of the Julia language
and gradually add more complex elements so that eventually, the reader should be able to:

  • read and write data in various formats;
  • work with tabular data, including subsetting, grouping, and transforming;
  • visualize data;
  • build predictive models;
  • create data processing pipelines;

and more.

As you can probably guess, the DataFrames.jl package is a backbone of this material.
The book was prepared as a textbook that can be used in a 1-semester introductory course on data analysis using Julia
and is accompanied by numerous extra materials that can be found here.

Tutorials

Over the years, I have created a lot of goal-oriented how-to guides.
You can find their list here.

Also, my blog since year 2020 brings you each week some practical information on working with Julia
(quite often DataFrames.jl oriented).

Finally, as a part of DataFrames.jl documentation here and here there are available introductory tutorials to DataFrames.jl.

Since the volume of the tutorials is large, it might be sometimes a bit hard to navigate, but probably this is unavoidable, as there is a substantial variety of questions that users might have.

Reference

I strongly prefer implementing the functionality of DataFrames.jl following the contract specified in the documentation
of provided functions. Therefore, I believe that we have quite a strong collection of reference materials that make it precise
how DataFrames.jl functionality is implemented. It is divided into four major parts:

  • specification of how types exposed by DataFrames.jl are designed is given here;
  • reference on provided functions can be found here;
  • a complete description of how indexing works in DataFrames.jl is available here;
  • information on how data frames handle table and column metadata is given here.

It is essential to highlight that these materials aim to be complete and precise. Unfortunately, this means
that they are verbose and sometimes hard to digest by new users. Unfortunately, I think this cannot be helped,
and that is why we provide other kinds of documentation to make it easier to get started with DataFrames.jl.

Conclusions

I hope that this post can serve DataFrames.jl users as a helpful guide to different resources I have co-authored
that are provided to make learning and using the package easy and fun. Enjoy!