Re-posted from: https://bkamins.github.io/julialang/2022/03/11/unnesting.html
Introduction
Today I want to discuss ways to nest and unnest columns of a data frame.
We say that we nest several columns, when we take them together and turn
into one column, usually containing NamedTuple
s.
Unnesting is a reverse process, we take a column storing e.g. NamedTuple
s,
and create several columns out of it.
The post was written under Julia 1.7.0, DataFrames.jl 1.3.2, and Tables.jl
1.7.0.
Column nesting
Column nesting is relatively simple in DataFrames.jl. You just need to use
ByRow(identity)
transformation on AsTable
source. Here is an example where
we nest all columns from a source data frame:
julia> using DataFrames
julia> df = DataFrame(a=1:3, b=4:6, c=7:9)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 4 7
2 │ 2 5 8
3 │ 3 6 9
julia> transform(df, AsTable(:) => ByRow(identity) => :nested)
3×4 DataFrame
Row │ a b c nested
│ Int64 Int64 Int64 NamedTup…
─────┼────────────────────────────────────────────
1 │ 1 4 7 (a = 1, b = 4, c = 7)
2 │ 2 5 8 (a = 2, b = 5, c = 8)
3 │ 3 6 9 (a = 3, b = 6, c = 9)
This works because AsTable
passes NamedTuple
objects to the function,
so we just need to apply identity
row-wise to get the desired result.
Basic column unnesting
If you want to perform a reverse process things are also relatively simple, you
just pass the nested column name as source and AsTable
as target column name:
julia> df2 = select(df, AsTable(:) => ByRow(identity) => :nested)
3×1 DataFrame
Row │ nested
│ NamedTup…
─────┼───────────────────────
1 │ (a = 1, b = 4, c = 7)
2 │ (a = 2, b = 5, c = 8)
3 │ (a = 3, b = 6, c = 9)
julia> transform(df2, :nested => AsTable)
3×4 DataFrame
Row │ nested a b c
│ NamedTup… Int64 Int64 Int64
─────┼────────────────────────────────────────────
1 │ (a = 1, b = 4, c = 7) 1 4 7
2 │ (a = 2, b = 5, c = 8) 2 5 8
3 │ (a = 3, b = 6, c = 9) 3 6 9
Complex column unnesting
Sometimes you might have a situation where you have a nested column
that has heterogeneous contents (i.e. has different column names in different rows).
In such a scenario basic unnesting pattern does not work as it requires all rows
to have the same schema:
julia> df3 = DataFrame(nested = [(a=1, b=2), (b=3, c=4), (a=5, c=6)])
3×1 DataFrame
Row │ nested
│ NamedTup…
─────┼────────────────
1 │ (a = 1, b = 2)
2 │ (b = 3, c = 4)
3 │ (a = 5, c = 6)
julia> transform(df3, :nested => AsTable)
ERROR: ArgumentError: keys of the returned elements must be identical
If you have such a situation you can use Tables.dictcolumntable
as a
transformation function:
julia> transform(df3, :nested => Tables.dictcolumntable => AsTable)
3×4 DataFrame
Row │ nested a b c
│ NamedTup… Int64? Int64? Int64?
─────┼───────────────────────────────────────────
1 │ (a = 1, b = 2) 1 2 missing
2 │ (b = 3, c = 4) missing 3 4
3 │ (a = 5, c = 6) 5 missing 6
As you can see the Tables.dictcolumntable
has “column unioning” behavior.
When some row does not have a column that is present in other rows it gets
a missing
value instead.
Conclusions
Column nesting and unnesting is needed when you work with data that has
hierarchical structure. A common example of such a scenario is JSON data. I
hope you will find the patterns I have discussed in this post useful in your
work.