Re-posted from: https://bkamins.github.io/julialang/2024/02/02/combine.html
Introduction
I have written in the past about DataFrames.jl operation specification syntax
(also called minilanguage), see for example this post or this post.
Today I want to discuss one design decision made in this minilanguage and its consequences.
It is related with how vectors are handled when they are returned from some transformation function.
The post was written under Julia 1.10.0 and DataFrames.jl 1.6.1.
A basic example
Consider the following example, where we want to compute a profit from some sales data:
julia> using DataFrames
julia> df = DataFrame(name=["A", "B", "C"],
revenue=[10, 20, 30],
cost=[5, 12, 18])
3×3 DataFrame
Row │ name revenue cost
│ String Int64 Int64
─────┼────────────────────────
1 │ A 10 5
2 │ B 20 12
3 │ C 30 18
julia> combine(df, All(), ["revenue", "cost"] => (-) => "profit")
3×4 DataFrame
Row │ name revenue cost profit
│ String Int64 Int64 Int64
─────┼────────────────────────────────
1 │ A 10 5 5
2 │ B 20 12 8
3 │ C 30 18 12
The crucial point to understand here is that the -
function takes
two columns "revenue"
and "cost"
and returns a vector.
Users typically expect, as in this example, that this vector should
be spread across several rows.
When vector spreading is not desirable?
However, there are cases, when we might not want to spread a vector
into multiple rows. Consider for example a transformation in which
we want to put "revenue"
and "profit"
values in a 2-element vector
per product. Intuitively we could write something like:
julia> combine(groupby(df, :name),
All(),
["revenue", "cost"] => ((x,y) -> [only(x), only(y)]) => "vec")
ERROR: ArgumentError: all functions must return vectors of the same length
We get an error unfortunately. We will soon understand why, but before
I proceed let me comment on the [only(x), only(y)]
part of the definition.
The only
function makes sure that we have exactly one row per product.
To diagnose the issue let us drop the All()
part in our call:
julia> combine(groupby(df, :name),
["revenue", "cost"] => ((x,y) -> [only(x), only(y)]) => "vec")
6×2 DataFrame
Row │ name vec
│ String Int64
─────┼───────────────
1 │ A 10
2 │ A 5
3 │ B 20
4 │ B 12
5 │ C 30
6 │ C 18
Now we understand the problem. Because our function returns a vector it gets
spread over several rows (which leads to an error as other columns of df
have
a different length).
Solving the vector-spreading issue
As I have said above, most of the time vector spreading is a desired feature,
but in the example we have just studied it is not wanted.
For such cases DataFrames.jl allows you to protect vectors from being spread.
What you need to do is to call Ref
function on the returned value.
This will protect the result from being spread:
julia> combine(groupby(df, :name),
All(),
["revenue", "cost"] => ((x,y) -> Ref([only(x), only(y)])) => "vec")
3×4 DataFrame
Row │ name revenue cost vec
│ String Int64 Int64 Array…
─────┼──────────────────────────────────
1 │ A 10 5 [10, 5]
2 │ B 20 12 [20, 12]
3 │ C 30 18 [30, 18]
Now, as we wanted, the entries of the "vec"
columns are vectors. Wrapping the return
value of our function with Ref
protected the vectors from being spread.
The alternative function that you could use to get the same effect is fill
:
julia> combine(groupby(df, :name),
All(),
["revenue", "cost"] => ((x,y) -> fill([only(x), only(y)])) => "vec")
3×4 DataFrame
Row │ name revenue cost vec
│ String Int64 Int64 Array…
─────┼──────────────────────────────────
1 │ A 10 5 [10, 5]
2 │ B 20 12 [20, 12]
3 │ C 30 18 [30, 18]
or you could wrap the return value with another pair of [...]
:
julia> combine(groupby(df, :name),
All(),
["revenue", "cost"] => ((x,y) -> [[only(x), only(y)]]) => "vec")
3×4 DataFrame
Row │ name revenue cost vec
│ String Int64 Int64 Array…
─────┼──────────────────────────────────
1 │ A 10 5 [10, 5]
2 │ B 20 12 [20, 12]
3 │ C 30 18 [30, 18]
What is going on here? In all three cases (Ref
, fill
, and [...]
) we are wrapping a vector in another object that works like an outer vector.
In the case of [...]
it is just a vector, fill
produces a 0
-dimensional array, and Ref
creates a wrapper that behaves like 0-dimensional array.
In all cases DataFrames.jl treats this outer wrapper as a 1-element vector and just stores its contents in a single row (because there is one element to store).
Conclusions
I hope that you will find the example I gave today useful when transforming vectors using DataFrames.jl.