Working with vectors using DataFrames.jl minilanguage

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/02/02/combine.html

Introduction

I have written in the past about DataFrames.jl operation specification syntax
(also called minilanguage), see for example this post or this post.

Today I want to discuss one design decision made in this minilanguage and its consequences.
It is related with how vectors are handled when they are returned from some transformation function.

The post was written under Julia 1.10.0 and DataFrames.jl 1.6.1.

A basic example

Consider the following example, where we want to compute a profit from some sales data:

julia> using DataFrames

julia> df = DataFrame(name=["A", "B", "C"],
                      revenue=[10, 20, 30],
                      cost=[5, 12, 18])
3×3 DataFrame
 Row │ name    revenue  cost
     │ String  Int64    Int64
─────┼────────────────────────
   1 │ A            10      5
   2 │ B            20     12
   3 │ C            30     18

julia> combine(df, All(), ["revenue", "cost"] => (-) => "profit")
3×4 DataFrame
 Row │ name    revenue  cost   profit
     │ String  Int64    Int64  Int64
─────┼────────────────────────────────
   1 │ A            10      5       5
   2 │ B            20     12       8
   3 │ C            30     18      12

The crucial point to understand here is that the - function takes
two columns "revenue" and "cost" and returns a vector.
Users typically expect, as in this example, that this vector should
be spread across several rows.

When vector spreading is not desirable?

However, there are cases, when we might not want to spread a vector
into multiple rows. Consider for example a transformation in which
we want to put "revenue" and "profit" values in a 2-element vector
per product. Intuitively we could write something like:

julia> combine(groupby(df, :name),
               All(),
               ["revenue", "cost"] => ((x,y) -> [only(x), only(y)]) => "vec")
ERROR: ArgumentError: all functions must return vectors of the same length

We get an error unfortunately. We will soon understand why, but before
I proceed let me comment on the [only(x), only(y)] part of the definition.
The only function makes sure that we have exactly one row per product.

To diagnose the issue let us drop the All() part in our call:

julia> combine(groupby(df, :name),
               ["revenue", "cost"] => ((x,y) -> [only(x), only(y)]) => "vec")
6×2 DataFrame
 Row │ name    vec
     │ String  Int64
─────┼───────────────
   1 │ A          10
   2 │ A           5
   3 │ B          20
   4 │ B          12
   5 │ C          30
   6 │ C          18

Now we understand the problem. Because our function returns a vector it gets
spread over several rows (which leads to an error as other columns of df have
a different length).

Solving the vector-spreading issue

As I have said above, most of the time vector spreading is a desired feature,
but in the example we have just studied it is not wanted.
For such cases DataFrames.jl allows you to protect vectors from being spread.
What you need to do is to call Ref function on the returned value.
This will protect the result from being spread:

julia> combine(groupby(df, :name),
               All(),
               ["revenue", "cost"] => ((x,y) -> Ref([only(x), only(y)])) => "vec")
3×4 DataFrame
 Row │ name    revenue  cost   vec
     │ String  Int64    Int64  Array…
─────┼──────────────────────────────────
   1 │ A            10      5  [10, 5]
   2 │ B            20     12  [20, 12]
   3 │ C            30     18  [30, 18]

Now, as we wanted, the entries of the "vec" columns are vectors. Wrapping the return
value of our function with Ref protected the vectors from being spread.

The alternative function that you could use to get the same effect is fill:

julia> combine(groupby(df, :name),
               All(),
               ["revenue", "cost"] => ((x,y) -> fill([only(x), only(y)])) => "vec")
3×4 DataFrame
 Row │ name    revenue  cost   vec
     │ String  Int64    Int64  Array…
─────┼──────────────────────────────────
   1 │ A            10      5  [10, 5]
   2 │ B            20     12  [20, 12]
   3 │ C            30     18  [30, 18]

or you could wrap the return value with another pair of [...]:

julia> combine(groupby(df, :name),
               All(),
               ["revenue", "cost"] => ((x,y) -> [[only(x), only(y)]]) => "vec")
3×4 DataFrame
 Row │ name    revenue  cost   vec
     │ String  Int64    Int64  Array…
─────┼──────────────────────────────────
   1 │ A            10      5  [10, 5]
   2 │ B            20     12  [20, 12]
   3 │ C            30     18  [30, 18]

What is going on here? In all three cases (Ref, fill, and [...]) we are wrapping a vector in another object that works like an outer vector.
In the case of [...] it is just a vector, fill produces a 0-dimensional array, and Ref creates a wrapper that behaves like 0-dimensional array.
In all cases DataFrames.jl treats this outer wrapper as a 1-element vector and just stores its contents in a single row (because there is one element to store).

Conclusions

I hope that you will find the example I gave today useful when transforming vectors using DataFrames.jl.