Nesting and unnesting columns in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/03/11/unnesting.html

Introduction

Today I want to discuss ways to nest and unnest columns of a data frame.

We say that we nest several columns, when we take them together and turn
into one column, usually containing NamedTuples.

Unnesting is a reverse process, we take a column storing e.g. NamedTuples,
and create several columns out of it.

The post was written under Julia 1.7.0, DataFrames.jl 1.3.2, and Tables.jl
1.7.0.

Column nesting

Column nesting is relatively simple in DataFrames.jl. You just need to use
ByRow(identity) transformation on AsTable source. Here is an example where
we nest all columns from a source data frame:

julia> using DataFrames

julia> df = DataFrame(a=1:3, b=4:6, c=7:9)
3×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      4      7
   2 │     2      5      8
   3 │     3      6      9

julia> transform(df, AsTable(:) => ByRow(identity) => :nested)
3×4 DataFrame
 Row │ a      b      c      nested
     │ Int64  Int64  Int64  NamedTup…
─────┼────────────────────────────────────────────
   1 │     1      4      7  (a = 1, b = 4, c = 7)
   2 │     2      5      8  (a = 2, b = 5, c = 8)
   3 │     3      6      9  (a = 3, b = 6, c = 9)

This works because AsTable passes NamedTuple objects to the function,
so we just need to apply identity row-wise to get the desired result.

Basic column unnesting

If you want to perform a reverse process things are also relatively simple, you
just pass the nested column name as source and AsTable as target column name:

julia> df2 = select(df, AsTable(:) => ByRow(identity) => :nested)
3×1 DataFrame
 Row │ nested
     │ NamedTup…
─────┼───────────────────────
   1 │ (a = 1, b = 4, c = 7)
   2 │ (a = 2, b = 5, c = 8)
   3 │ (a = 3, b = 6, c = 9)

julia> transform(df2, :nested => AsTable)
3×4 DataFrame
 Row │ nested                 a      b      c
     │ NamedTup…              Int64  Int64  Int64
─────┼────────────────────────────────────────────
   1 │ (a = 1, b = 4, c = 7)      1      4      7
   2 │ (a = 2, b = 5, c = 8)      2      5      8
   3 │ (a = 3, b = 6, c = 9)      3      6      9

Complex column unnesting

Sometimes you might have a situation where you have a nested column
that has heterogeneous contents (i.e. has different column names in different rows).
In such a scenario basic unnesting pattern does not work as it requires all rows
to have the same schema:

julia> df3 = DataFrame(nested = [(a=1, b=2), (b=3, c=4), (a=5, c=6)])
3×1 DataFrame
 Row │ nested
     │ NamedTup…
─────┼────────────────
   1 │ (a = 1, b = 2)
   2 │ (b = 3, c = 4)
   3 │ (a = 5, c = 6)

julia> transform(df3, :nested => AsTable)
ERROR: ArgumentError: keys of the returned elements must be identical

If you have such a situation you can use Tables.dictcolumntable as a
transformation function:

julia> transform(df3, :nested => Tables.dictcolumntable => AsTable)
3×4 DataFrame
 Row │ nested          a        b        c
     │ NamedTup…       Int64?   Int64?   Int64?
─────┼───────────────────────────────────────────
   1 │ (a = 1, b = 2)        1        2  missing
   2 │ (b = 3, c = 4)  missing        3        4
   3 │ (a = 5, c = 6)        5  missing        6

As you can see the Tables.dictcolumntable has “column unioning” behavior.
When some row does not have a column that is present in other rows it gets
a missing value instead.

Conclusions

Column nesting and unnesting is needed when you work with data that has
hierarchical structure. A common example of such a scenario is JSON data. I
hope you will find the patterns I have discussed in this post useful in your
work.