Transforming multiple columns with multiple functions in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/07/09/multicol.html

Introduction

A common question asked in relation to mutation of data frame objects in
DataFrames.jl is how to apply multiple transformation functions to
multiple columns of a data frame. In this post I want to show several possible
approaches to this task.

The codes were run under Julia 1.6.1 and DataFrames.jl 1.2.0.

The basic approach

First create a data frame we will work with:

julia> using DataFrames

julia> df = DataFrame(id=repeat(1:2, 3), c1=1:6, c2=11:16)
6×3 DataFrame
 Row │ id     c1     c2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1     11
   2 │     2      2     12
   3 │     1      3     13
   4 │     2      4     14
   5 │     1      5     15
   6 │     2      6     16

Now assume we want to calculate minimum, maximum and sum of columns
:c1 and :c2. You can do it like this:

julia> combine(df, [:c1, :c2] .=> [minimum maximum sum])
1×6 DataFrame
 Row │ c1_minimum  c2_minimum  c1_maximum  c2_maximum  c1_sum  c2_sum
     │ Int64       Int64       Int64       Int64       Int64   Int64
─────┼────────────────────────────────────────────────────────────────
   1 │          1          11           6          16      21      81

or like this:

julia> combine(df, [:c1 :c2] .=> [minimum, maximum, sum])
1×6 DataFrame
 Row │ c1_minimum  c1_maximum  c1_sum  c2_minimum  c2_maximum  c2_sum
     │ Int64       Int64       Int64   Int64       Int64       Int64
─────┼────────────────────────────────────────────────────────────────
   1 │          1           6      21          11          16      81

depending on what order of output columns you prefer.

Let us try to understand what is going on here. The combine function accepts
vectors or matrices of transformation specifications given as pairs using the
=> operator. If we check the internal broadcasting operation we see:

julia> [:c1 :c2] .=> [minimum, maximum, sum]
3×2 Matrix{Pair{Symbol, _A} where _A}:
 :c1=>minimum  :c2=>minimum
 :c1=>maximum  :c2=>maximum
 :c1=>sum      :c2=>sum

julia> [:c1, :c2] .=> [minimum maximum sum]
2×3 Matrix{Pair{Symbol, _A} where _A}:
 :c1=>minimum  :c1=>maximum  :c1=>sum
 :c2=>minimum  :c2=>maximum  :c2=>sum

In both cases we create a matrix of transformations. Note that the crucial
trick here is that we broadcast a vector againsst a 1-row matrix to get a final
matrix. For instance in [:c1 :c2] .=> [minimum, maximum, sum], the [:c1 :c2]
part is a 1-row matrix (notice that a space is separating its elements) and
[minimum, maximum, sum] is a vector (notice , separating its elements).

Exactly the same pattern can be applied to grouped data frames as you can see here:

julia> gdf = groupby(df, :id)
GroupedDataFrame with 2 groups based on key: id
First Group (3 rows): id = 1
 Row │ id     c1     c2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1     11
   2 │     1      3     13
   3 │     1      5     15
⋮
Last Group (3 rows): id = 2
 Row │ id     c1     c2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2      2     12
   2 │     2      4     14
   3 │     2      6     16

julia> combine(gdf, [:c1, :c2] .=> [minimum maximum sum])
2×7 DataFrame
 Row │ id     c1_minimum  c2_minimum  c1_maximum  c2_maximum  c1_sum  c2_sum
     │ Int64  Int64       Int64       Int64       Int64       Int64   Int64
─────┼───────────────────────────────────────────────────────────────────────
   1 │     1           1          11           5          15       9      39
   2 │     2           2          12           6          16      12      42

julia> combine(gdf, [:c1 :c2] .=> [minimum, maximum, sum])
2×7 DataFrame
 Row │ id     c1_minimum  c1_maximum  c1_sum  c2_minimum  c2_maximum  c2_sum
     │ Int64  Int64       Int64       Int64   Int64       Int64       Int64
─────┼───────────────────────────────────────────────────────────────────────
   1 │     1           1           5       9          11          15      39
   2 │     2           2           6      12          12          16      42

Programmatic specification of column names

Sometimes instead of writing down [:c1, :c2] manually one would want to select
the columns programmatically. Here are some approaches we currently support:

The most general one is by using the names function with an appropriate column
selector:

julia> combine(df, names(df, r"c") .=> [minimum maximum sum])
1×6 DataFrame
 Row │ c1_minimum  c2_minimum  c1_maximum  c2_maximum  c1_sum  c2_sum
     │ Int64       Int64       Int64       Int64       Int64   Int64
─────┼────────────────────────────────────────────────────────────────
   1 │          1          11           6          16      21      81

or

julia> combine(df, names(df, Not(:id)) .=> [minimum maximum sum])
1×6 DataFrame
 Row │ c1_minimum  c2_minimum  c1_maximum  c2_maximum  c1_sum  c2_sum
     │ Int64       Int64       Int64       Int64       Int64   Int64
─────┼────────────────────────────────────────────────────────────────
   1 │          1          11           6          16      21      81

With grouped data frames an option that is often useful is to use the
valuecols funcion that picks all non-grouping columns:

julia> combine(gdf, valuecols(gdf) .=> [minimum maximum sum])
2×7 DataFrame
 Row │ id     c1_minimum  c2_minimum  c1_maximum  c2_maximum  c1_sum  c2_sum
     │ Int64  Int64       Int64       Int64       Int64       Int64   Int64
─────┼───────────────────────────────────────────────────────────────────────
   1 │     1           1          11           5          15       9      39
   2 │     2           2          12           6          16      12      42

As above, in all the cases it is useful to investigate the arguments
passed to the enclosing functions:

julia> names(df, r"c")
2-element Vector{String}:
 "c1"
 "c2"

julia> names(df, Not(:id))
2-element Vector{String}:
 "c1"
 "c2"

julia> valuecols(gdf)
2-element Vector{Symbol}:
 :c1
 :c2

Nested columns approach

Another approach one could use to apply multiple functions to multiple colums
is to create nested columns like this:

julia> combine(df,
               [:c1, :c2] .=> x -> [(min=minimum(x), max=maximum(x), sum=sum(x))],
               renamecols=false)
1×2 DataFrame
 Row │ c1                            c2
     │ NamedTup…                     NamedTup…
─────┼──────────────────────────────────────────────────────────────
   1 │ (min = 1, max = 6, sum = 21)  (min = 11, max = 16, sum = 81)

or

julia> combine(df,
               [:c1, :c2] .=> x -> Ref((min=minimum(x), max=maximum(x), sum=sum(x))),
               renamecols=false)
1×2 DataFrame
 Row │ c1                            c2
     │ NamedTup…                     NamedTup…
─────┼──────────────────────────────────────────────────────────────
   1 │ (min = 1, max = 6, sum = 21)  (min = 11, max = 16, sum = 81)

Notice that the anonymous function we use produces a NamedTuple, so we need
to protect it against being expanded by wrapping it with a vector or Ref object.
This trick similar to what is done in broadcasting in Julia Base. Let me give
a simple example:

julia> [1 2] .=> (a=1, b=2)
ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved

julia> [1 2] .=> [(a=1, b=2)]
1×2 Matrix{Pair{Int64, NamedTuple{(:a, :b), Tuple{Int64, Int64}}}}:
 1=>(a = 1, b = 2)  2=>(a = 1, b = 2)

julia> [1 2] .=> Ref((a=1, b=2))
1×2 Matrix{Pair{Int64, NamedTuple{(:a, :b), Tuple{Int64, Int64}}}}:
 1=>(a = 1, b = 2)  2=>(a = 1, b = 2)

and as you can see wrapping a NamedTuple in a vector or Ref made broadcasting
use it “as is” (essentially we had to create a 1-element container that was
broadcasted over). As a side note: Ref is usually a technically preferred way
to make such a protection, but vector is easier to type and in most cases it
produces exactly the same result, so I often use it.

Going back to our aggregation case, the nested column is most useful with
grouped data frame:

julia> df_agg = combine(gdf,
                        [:c1, :c2] .=> x->[(min=minimum(x), max=maximum(x), sum=sum(x))],
                        renamecols=false)
2×3 DataFrame
 Row │ id     c1                            c2
     │ Int64  NamedTup…                     NamedTup…
─────┼─────────────────────────────────────────────────────────────────────
   1 │     1  (min = 1, max = 5, sum = 9)   (min = 11, max = 15, sum = 39)
   2 │     2  (min = 2, max = 6, sum = 12)  (min = 12, max = 16, sum = 42)

Finally you might ask how to un-nest the columns. If you want to do it for a
single column you can just write:

julia> select(df_agg, :id, :c1 => AsTable)
2×4 DataFrame
 Row │ id     min    max    sum
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      1      5      9
   2 │     2      2      6     12

However, for multiple columns this will fail:

julia> select(df_agg, :id, [:c1, :c2] .=> AsTable)
ERROR: ArgumentError: Duplicate column name(s) returned: :min, :max, :sum

The problem is that, as you can see, we would try to create columns with names
:min, :max, and :sum twice, which is disallowed.

What you can do instead is to provide explicit column names:

julia> select(df_agg, :id, [:c1, :c2] .=> [c .* ["min", "max", "sum"] for c in ["c1_", "c2_"]])
2×7 DataFrame
 Row │ id     c1_min  c1_max  c1_sum  c2_min  c2_max  c2_sum
     │ Int64  Int64   Int64   Int64   Int64   Int64   Int64
─────┼───────────────────────────────────────────────────────
   1 │     1       1       5       9      11      15      39
   2 │     2       2       6      12      12      16      42

As usual, it is worth to investigate the argument passed to the combine function
to make sure we understand exactly the syntax:

julia> [:c1, :c2] .=> [c .* ["min", "max", "sum"] for c in ["c1_", "c2_"]]
2-element Vector{Pair{Symbol, Vector{String}}}:
 :c1 => ["c1_min", "c1_max", "c1_sum"]
 :c2 => ["c2_min", "c2_max", "c2_sum"]

Conclusions

In conclusion let me highlight one important feature of the syntax using the
=> operator that we have exploited in this post.

The trick is that => is a callable (technically it calls a Pair
constructor). Therefore it is very easy to use it in programmatic scenarios
when one might want to specify column names or applied functions at run time
e.g. by storing them in some variables.