Handling vectors of vectors in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/08/27/nested.html

Introduction

In this post I want to explore a topic that is a corner case of
transformation mini-language in DataFrames.jl.
The issue is about transformations that produce vectors as values.
This question arises when users start to do more advanced operations
so I decided that it deserves a deeper treatment (as in the last post
this time I have chosen a topic following a question from the user I got recently).

This post was written under Julia 1.6.1, Arrow 1.6.2, and DataFrames.jl 1.2.2.

The standard behavior

By default if a transformation operation returns a vector it gets expanded
into multiple rows, as this is something that typically users expect.
Here is a basic example:

julia> using DataFrames

julia> df = DataFrame(x=[1, 1, 2, 3, 3])
5×1 DataFrame
 Row │ x
     │ Int64
─────┼───────
   1 │     1
   2 │     1
   3 │     2
   4 │     3
   5 │     3

julia> combine(df, :x => reverse)
5×1 DataFrame
 Row │ x_reverse
     │ Int64
─────┼───────────
   1 │         3
   2 │         3
   3 │         2
   4 │         1
   5 │         1

julia> combine(df, :x => unique)
3×1 DataFrame
 Row │ x_unique
     │ Int64
─────┼──────────
   1 │        1
   2 │        2
   3 │        3

As you can see in the unique example the number of rows produced is flexible
with combine. This would not be the case for select or transform which
require the number of rows in the result to match the source, so we get:

julia> select(df, :x => reverse)
5×1 DataFrame
 Row │ x_reverse
     │ Int64
─────┼───────────
   1 │         3
   2 │         3
   3 │         2
   4 │         1
   5 │         1

julia> select(df, :x => unique)
ERROR: ArgumentError: length 3 of vector returned from function unique is different from number of rows 5 of the source data frame.

Similar rules apply when split-apply-combine strategy is used:

julia> gdf = groupby(DataFrame(id=[1, 1, 1, 2, 2, 2], x=[1, 1, 2, 1, 3, 3]), :id)
GroupedDataFrame with 2 groups based on key: id
First Group (3 rows): id = 1
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      1
   3 │     1      2
⋮
Last Group (3 rows): id = 2
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     2      1
   2 │     2      3
   3 │     2      3

julia> combine(gdf, :x => reverse)
6×2 DataFrame
 Row │ id     x_reverse
     │ Int64  Int64
─────┼──────────────────
   1 │     1          2
   2 │     1          1
   3 │     1          1
   4 │     2          3
   5 │     2          3
   6 │     2          1

julia> combine(gdf, :x => unique)
4×2 DataFrame
 Row │ id     x_unique
     │ Int64  Int64
─────┼─────────────────
   1 │     1         1
   2 │     1         2
   3 │     2         1
   4 │     2         3

julia> select(gdf, :x => reverse)
6×2 DataFrame
 Row │ id     x_reverse
     │ Int64  Int64
─────┼──────────────────
   1 │     1          2
   2 │     1          1
   3 │     1          1
   4 │     2          3
   5 │     2          3
   6 │     2          1

julia> select(gdf, :x => unique)
ERROR: ArgumentError: all functions must return vectors with as many values as rows in each group

Putting a vector into a single row

Sometimes one wants to put a vector into a single row of the resulting data frame.
In such case the recommended way to achieve the desired result is to wrap the result
in Ref, just like in broadcasting:

julia> combine(df, :x => Ref∘unique)
1×1 DataFrame
 Row │ x_Ref_unique
     │ Array…
─────┼──────────────
   1 │ [1, 2, 3]

julia> combine(gdf, :x => Ref∘unique)
2×2 DataFrame
 Row │ id     x_Ref_unique
     │ Int64  Array…
─────┼─────────────────────
   1 │     1  [1, 2]
   2 │     2  [1, 3]

This pattern is typically most useful when working with grouped data frames.

Here it is worth to mention that this wrapping is not required if we are
performing a ByRow operation, as ByRow automatically wraps everything our
transformation function produces in an additional vector as a container.

Here is an example:

julia> select(df, :x => ByRow(x -> fill(x, x)))
5×1 DataFrame
 Row │ x_function
     │ Array…
─────┼────────────
   1 │ [1]
   2 │ [1]
   3 │ [2, 2]
   4 │ [3, 3, 3]
   5 │ [3, 3, 3]

If you wanted to expand this result into multiple rows I recommend using the
flatten function in the post-processing, as I discuss it in this post:

julia> flatten(select(df, :x => ByRow(x -> fill(x, x))), 1)
10×1 DataFrame
 Row │ x_function
     │ Int64
─────┼────────────
   1 │          1
   2 │          1
   3 │          2
   4 │          2
   5 │          3
   6 │          3
   7 │          3
   8 │          3
   9 │          3
  10 │          3

Producing a table as an output from a transformation

The patterns above are standard. However, as users get more advanced they
start doing complex transformations that produce multiple columns in their code.

Here is a simple example to start with:

julia> select(df, :x => ByRow(x -> (a=x, b=fill(x, x))) => AsTable)
5×2 DataFrame
 Row │ a      b
     │ Int64  Array…
─────┼──────────────────
   1 │     1  [1]
   2 │     1  [1]
   3 │     2  [2, 2]
   4 │     3  [3, 3, 3]
   5 │     3  [3, 3, 3]

This worked nicely because ByRow(x -> (a=x, b=fill(x, x))) produced a vector
of NamedTuples that was cleanly handled by AsTable.

However, the following fails when aggregating GroupedDataFrame:

julia> combine(gdf, :x => (x -> (a=sum(x), b=x)) => AsTable)
ERROR: ArgumentError: mixing single values and vectors in a named tuple is not allowed

The problem is that the NamedTuple the transformation produces mixes a scalar
(in the column :a) and a vector (in the column :b). This is disallowed as it
would be not clear if the user wants the scalar value of :a to be broadcasted or
the vector stored in :b to be put into a single row of output.

Here are the ways to achieve both behaviors. If you want to broadcast :a into
multiple rows to match the length of :b the simplest approach is to use
a DataFrame instead of a NamedTuple:

julia> combine(gdf, :x => (x -> DataFrame(a=sum(x), b=x)) => AsTable)
6×3 DataFrame
 Row │ id     a      b
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      4      1
   2 │     1      4      1
   3 │     1      4      2
   4 │     2      7      1
   5 │     2      7      3
   6 │     2      7      3

Here the trick is that the DataFrame constructor performs pseudo-broadcasting
automatically.

On the other hand if you want the vector stored in :b to be put into a single row
do one of the following:

julia> combine(gdf, :x => (x -> (a=[sum(x)], b=[x])) => AsTable)
2×3 DataFrame
 Row │ id     a      b
     │ Int64  Int64  SubArray…
─────┼─────────────────────────
   1 │     1      4  [1, 1, 2]
   2 │     2      7  [1, 3, 3]

julia> combine(gdf, :x => (x -> DataFrame(a=sum(x), b=Ref(x))) => AsTable)
2×3 DataFrame
 Row │ id     a      b
     │ Int64  Int64  SubArray…
─────┼─────────────────────────
   1 │     1      4  [1, 1, 2]
   2 │     2      7  [1, 3, 3]

In the first approach we wrapped both :a and :b in a vector, and in the second
we used the pseudo-broadcasting supported by the DataFrame constructor again.

Why using nested vectors might be beneficial?

There are two kinds of benefits. One is readability. The other is performance.

Regarding the readability. Consider you have a homogeneous set of values. Then
you might prefer to store them in a one column to keep them together.
Here is an example of such data:

julia> df1 = DataFrame(x = [fill(i, 1000) for i in 1:10000])
10000×1 DataFrame
   Row │ x
       │ Array…
───────┼───────────────────────────────────
     1 │ [1, 1, 1, 1, 1, 1, 1, 1, 1, 1  ……
     2 │ [2, 2, 2, 2, 2, 2, 2, 2, 2, 2  ……
     3 │ [3, 3, 3, 3, 3, 3, 3, 3, 3, 3  ……
   ⋮   │                 ⋮
  9999 │ [9999, 9999, 9999, 9999, 9999, 9…
 10000 │ [10000, 10000, 10000, 10000, 100…
                          9995 rows omitted

julia> df2 = select(df1, :x => AsTable)
10000×1000 DataFrame
   Row │ x1     x2     x3     x4     x5     x6     x7     x8  ⋯
       │ Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int ⋯
───────┼───────────────────────────────────────────────────────
     1 │     1      1      1      1      1      1      1      ⋯
     2 │     2      2      2      2      2      2      2
     3 │     3      3      3      3      3      3      3
   ⋮   │   ⋮      ⋮      ⋮      ⋮      ⋮      ⋮      ⋮      ⋮ ⋱
  9999 │  9999   9999   9999   9999   9999   9999   9999   99
 10000 │ 10000  10000  10000  10000  10000  10000  10000  100 ⋯
                              993 columns and 9995 rows omitted

Of course it is subjective, but provided that the nested data makes sense to be
kept together (e.g. it could be a time series), I would prefer to use df1 than
df2. Clearly, in this case one could argue that one could consider using narrow
rather than wide
table format, but in practical cases we would typically
have many additional columns with e.g. metadata that would have a constant value
for the whole series, and then I often find it more convenient not to use narrow
format.

Now let us handle the performance issue. Assume we would want to transform the
data by summing it. Here are the timings of operations in both cases
(the timings are for a second run of each operation):

julia> @time select(df1, :x => ByRow(sum));
  0.006237 seconds (102 allocations: 83.953 KiB)

julia> @time select(df2, x -> sum.(eachrow(x)));
  1.829907 seconds (68.27 M allocations: 1.167 GiB, 4.75% gc time, 1.30% compilation time)

Note that in the case of df2 one could consider writing something like
select(df2, names(df2, r"x") => ByRow(+)) to make the operation type-stable,
but the performance of this will be very bad. In this discussion you
can read about the future plans of improving simple row aggregations (like
sum here), but I could have used some more complex transformation operation
which even after that changes would be much faster on nested column.

In short – nested column, assuming its eltype is concrete, solves the tension
between type instability of DataFrame vs potentially extremely long compilation
times when you want to switch to a type stable mode via e.g. a Tuple or
a NamedTuple.

Storage of data frames with nested columns

One drawback of nested columns is that they cannot be easily persistently stored
in CSV files. However, they are easy enough to work with using Arrow.jl:

julia> Arrow.write("test.arrow", df1)
"test.arrow"

julia> df1_read = DataFrame(Arrow.Table("test.arrow"));

julia> df1_read == df1
true

Here, you just need to remember that df1_read created this way is read-only
(at the benefit of performance). You would need to write
DataFrame(Arrow.Table("test.arrow"), copycols=true) to get mutable columns.

Conclusions

Vectors of vectors are not used very commonly in the other data processing
ecosystems. However, DataFrames.jl is designed to allow easy processing
of such data and I find them convenient surprisingly often.