Getting ready for JuliaCon Local Eindhoven 2023

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/10/13/juliacon.html

Introduction

I am really excited that soon we will have a Julia conference in Europe.
The JuliaCon Local Eindhoven 2023 will take place on December 1, 2023 in Eindhoven.

I thought that as nice preparation for this event it would be useful to share
some tips that can be useful for DataFrames.jl users.
Frequently, when working with tables, one has to process time series data.
I want to share a few examples how such data can be stored in DataFrames.jl.

The post was written under Julia 1.9.2 and DataFrames.jl 1.6.1.

Setting up the stage

Let us start with a data frame that has 4 columns and 1000 rows representing consecutive measurements of some data.
I also add the :period column keeping track of the moment at which the measurement was made.

julia> using DataFrames

julia> using Random

julia> Random.seed!(1234);

julia> df1 = DataFrame(rand(1000, 4), :auto);

julia> df1.period = 1:1000;

julia> df1
1000×5 DataFrame
  Row │ x1         x2         x3        x4         period
      │ Float64    Float64    Float64   Float64    Int64
──────┼───────────────────────────────────────────────────
    1 │ 0.579862   0.473374   0.366288  0.771849        1
    2 │ 0.411294   0.176997   0.652475  0.938484        2
    3 │ 0.972136   0.676856   0.900155  0.390673        3
    4 │ 0.0149088  0.177963   0.374975  0.694063        4
  ⋮   │     ⋮          ⋮         ⋮          ⋮        ⋮
  997 │ 0.424295   0.78012    0.577681  0.274228      997
  998 │ 0.481554   0.0441882  0.700181  0.317677      998
  999 │ 0.401528   0.658903   0.438309  0.211217      999
 1000 │ 0.58985    0.909857   0.354735  0.442829     1000
                                          992 rows omitted

Reshaping data: wide to long

The first operation we want to do is reshaping the data from wide to long format,
in which we will have three columns: the period, the variable name, and the variable value.
To achieve this we can use the stack function:

julia> df2 = stack(df1, Not(:period))
4000×3 DataFrame
  Row │ period  variable  value
      │ Int64   String    Float64
──────┼─────────────────────────────
    1 │      1  x1        0.579862
    2 │      2  x1        0.411294
    3 │      3  x1        0.972136
    4 │      4  x1        0.0149088
  ⋮   │   ⋮        ⋮          ⋮
 3997 │    997  x4        0.274228
 3998 │    998  x4        0.317677
 3999 │    999  x4        0.211217
 4000 │   1000  x4        0.442829
                   3992 rows omitted

Reshaping data: long to wide

Now let us perform the reverse operation, but this time put :period as columns.
We can do it using the unstack function:

julia> df3 = unstack(df2, :variable, :period, :value)
4×1001 DataFrame
 Row │ variable  1         2         3         4          5         6        ⋯
     │ String    Float64?  Float64?  Float64?  Float64?   Float64?  Float64? ⋯
─────┼────────────────────────────────────────────────────────────────────────
   1 │ x1        0.579862  0.411294  0.972136  0.0149088  0.520355  0.639562 ⋯
   2 │ x2        0.473374  0.176997  0.676856  0.177963   0.670122  0.042361
   3 │ x3        0.366288  0.652475  0.900155  0.374975   0.387857  0.779594
   4 │ x4        0.771849  0.938484  0.390673  0.694063   0.724383  0.130453
                                                           995 columns omitted

Nesting columns

Another way to reshape this data is to put all periods in a single cell of a data frame.
We can do it in two ways.

The first is to keep them in columns :x1, :x2, :x3, and :x4. Here are two ways
how you can do it starting from df1 or df2:

julia> combine(df1, Not(:period) .=> Ref, renamecols=false)
1×4 DataFrame
 Row │ x1                                 x2                                 ⋯
     │ Array…                             Array…                             ⋯
─────┼────────────────────────────────────────────────────────────────────────
   1 │ [0.579862, 0.411294, 0.972136, 0…  [0.473374, 0.176997, 0.676856, 0…  ⋯
                                                             2 columns omitted
julia> unstack(df2, [], :variable, :value, combine=collect)
1×4 DataFrame
 Row │ x1                                 x2                                 ⋯
     │ Array…?                            Array…?                            ⋯
─────┼────────────────────────────────────────────────────────────────────────
   1 │ [0.579862, 0.411294, 0.972136, 0…  [0.473374, 0.176997, 0.676856, 0…  ⋯
                                                             2 columns omitted

Alternatively, we could want to have two columns, the first with variable names
and the second with the vectors. Again, let me show two ways to do it starting
from df2 and df3 data frames:

julia> df4 = combine(groupby(df2, :variable), :value => Ref, renamecols=false)
4×2 DataFrame
 Row │ variable  value
     │ String    SubArray…
─────┼─────────────────────────────────────────────
   1 │ x1        [0.579862, 0.411294, 0.972136, 0…
   2 │ x2        [0.473374, 0.176997, 0.676856, 0…
   3 │ x3        [0.366288, 0.652475, 0.900155, 0…
   4 │ x4        [0.771849, 0.938484, 0.390673, 0…

julia> combine(df3, :variable, AsTable(Not(:variable)) => ByRow(identity∘collect) => :value)
4×2 DataFrame
 Row │ variable  value
     │ String    Array…
─────┼─────────────────────────────────────────────
   1 │ x1        Union{Missing, Float64}[0.579862…
   2 │ x2        Union{Missing, Float64}[0.473374…
   3 │ x3        Union{Missing, Float64}[0.366288…
   4 │ x4        Union{Missing, Float64}[0.771849…

These examples are slightly more complicated so let me explain them.

In the first one we use Ref to protect a vector against expansion into multiple columns
(note that no copying of data happens here, to perform a copy you should write Ref∘copy).

In the second one the AsTable(Not(:variable)) source column selector produces a NamedTuple.
However, Julia users probably are aware that with 1000 entries such a NamedTuple would take a long
time to compile. For this reason we are using an optimization that DataFrames.jl provides.
If we pass a function of a form some_function∘collect then this compilation step is avoided.
In our case the function is just identity as we are happy with producing a vector (which collect already returns).

Unnesting columns

As a last example let us show how to expand the df4 data frame back into multiple columns.
This is easy. Just write:

julia> select(df4, :variable, :value => AsTable)
4×1001 DataFrame
 Row │ variable  x1        x2        x3        x4         x5        x6       ⋯
     │ String    Float64   Float64   Float64   Float64    Float64   Float64  ⋯
─────┼────────────────────────────────────────────────────────────────────────
   1 │ x1        0.579862  0.411294  0.972136  0.0149088  0.520355  0.639562 ⋯
   2 │ x2        0.473374  0.176997  0.676856  0.177963   0.670122  0.042361
   3 │ x3        0.366288  0.652475  0.900155  0.374975   0.387857  0.779594
   4 │ x4        0.771849  0.938484  0.390673  0.694063   0.724383  0.130453
                                                           995 columns omitted

Since we used AsTable as target column names specifier we got auto-generated column names.

Conclusions

I hope you liked the examples I presented today and learned something from them.

I am sure that if you attend JuliaCon Local Eindhoven 2023 you are going to
see a lot of highly informative and inspirational talks there!