How DataFrames.jl helps fighting piracy

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/07/22/regex.html

Introduction

During JuliaCon 2022 I gave a tutorial on DataFrames.jl.
You can find its recording on YouTube and all source code on GitHub.

This post is a follow up to one of the questions that I got during
the workshop. The topic of the discussion was applying the same function to
many columns of a data frame. Since the question is quite technical I will
first give you a brief introduction to the topic and next dive deep into the
issue.

I hope that this will be a useful material even for people that do not use
DataFrames.jl as we will explore the consequences of the
avoid type piracy rule that the Julia Manual recommends.

The post was written under Julia 1.7.2, DataFrames.jl 1.3.4.

How can one apply a function to multiple columns in DataFrames.jl?

Let us create a sample data frame first:

julia> using DataFrames

julia> df = DataFrame(a1=1:2, b1=3:4, a2=5:6)
2×3 DataFrame
 Row │ a1     b1     a2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      3      5
   2 │     2      4      6

Next I define a simple function that allows us to inspect what arguments
it received:

julia> inspect(x...) = Ref(x)
inspect (generic function with 1 method)

In this function I wrap a tuple of arguments in Ref as in DataFrames.jl
Ref protects the wrapped value against being expanded.

Let us check this function in action on some simple examples of combine
transformation (if you do not know the operation specification syntax please
check my tutorial on DataFrames.jl I have linked above for an
introduction):

julia> combine(df,
               :a1 => inspect,
               r"a" => inspect,
               Cols(endswith("1")) => inspect)
1×3 DataFrame
 Row │ a1_inspect  a1_a2_inspect     a1_b1_inspect
     │ Tuple…      Tuple…            Tuple…
─────┼────────────────────────────────────────────────
   1 │ ([1, 2],)   ([1, 2], [5, 6])  ([1, 2], [3, 4])

julia> combine(df,
               AsTable(:a1) => inspect,
               AsTable(r"a") => inspect,
               AsTable(Cols(endswith("1"))) => inspect)
1×3 DataFrame
Row │ a1_inspect      a1_a2_inspect             a1_b1_inspect
    │ Tuple…          Tuple…                    Tuple…
────┼─────────────────────────────────────────────────────────────────────
  1 │ ((a1=[1, 2],),) ((a1=[1, 2], a2=[5, 6]),) ((a1=[1, 2], b1=[3, 4]),)

What we can see in these examples is the following:

  • r"a" picks all columns whose names match this regular expression
    (in this case contain "a");
  • Cols(endswith("1")) picks all columns whose names meet the endswith("1")
    predicate (that is, end with "1");
  • by default the selected columns are passed as multiple positional arguments to
    the executed function;
  • if you wrap the selected columns in AsTable then they get passed a single
    positional argument in a NamedTuple.

It is crucially important to understand at this point that r"a" and
Cols(endswith("1")) column selectors do not get resolved before being passed
to combine:

julia> r"a" => inspect
r"a" => inspect

julia> Cols(endswith("1")) => inspect
Cols{Tuple{Base.Fix2{typeof(endswith), String}}}((Base.Fix2{typeof(endswith), String}(endswith, "1"),)) => inspect

What I mean by resolved is that the expression will get its meaning (i.e. will
determine which columns it actually selects, inside the combine function in
the context of the data frame that is passed as a first argument to combine).

Having seen these basic examples, let us check how one can apply a given
function to multiple columns individually. You can do it e.g. like this:

julia> combine(df, :a1 => inspect, :a2 => inspect)
1×2 DataFrame
 Row │ a1_inspect  a2_inspect
     │ Tuple…      Tuple…
─────┼────────────────────────
   1 │ ([1, 2],)   ([5, 6],)

but there is an easier way. You can do it like this:

julia> combine(df, [:a1, :a2] .=> inspect)
1×2 DataFrame
 Row │ a1_inspect  a2_inspect
     │ Tuple…      Tuple…
─────┼────────────────────────
   1 │ ([1, 2],)   ([5, 6],)

The point is that combine (and other functions in DataFrames.jl) accept
vectors and matrices of operation specification syntax expressions and above
I create such a vector using broadcasting with .=>.

Let us check:

julia> [:a1, :a2] .=> inspect
2-element Vector{Pair{Symbol, typeof(inspect)}}:
 :a1 => inspect
 :a2 => inspect

julia> combine(df, [:a1 => inspect, :a2 => inspect])
1×2 DataFrame
 Row │ a1_inspect  a2_inspect
     │ Tuple…      Tuple…
─────┼────────────────────────
   1 │ ([1, 2],)   ([5, 6],)

Having the information I have shared above we are now ready to face the pirates.

Using column selectors to pick columns to which we apply the function

We have seen above that r"a" and Cols(endswith("1")) do not get resolved
before they get passed to combine. Therefore how can we apply the inspect
function to all columns selected by them?

The basic approach is to use the names function like this:

julia> names(df, r"a")
2-element Vector{String}:
 "a1"
 "a2"

julia> names(df, r"a") .=> inspect
2-element Vector{Pair{String, typeof(inspect)}}:
 "a1" => inspect
 "a2" => inspect

julia> combine(df, names(df, r"a") .=> inspect)
1×2 DataFrame
 Row │ a1_inspect  a2_inspect
     │ Tuple…      Tuple…
─────┼────────────────────────
   1 │ ([1, 2],)   ([5, 6],)

This method works, but it is a bit heavy-handed, as it requires us to use
the names function that needs the df as its first argument. This duplication
of information is not optimal.

Therefore we are tempted to skip the names(df, r"a") and just write
r"a" .=> inspect. Let us see what it gives us:

julia> combine(df, r"a" .=> inspect)
1×1 DataFrame
 Row │ a1_a2_inspect
     │ Tuple…
─────┼──────────────────
   1 │ ([1, 2], [5, 6])

julia> r"a" .=> inspect
r"a" => inspect

Unfortunately this is not what we expected. The reason is that, as you can see,
r"a" is treated by broadcasting as a scalar. Here we need to note that
the r"a" .=> inspect is resolved before the value of this expression is
passed to combine, so evaluation of this expression cannot be made by Julia
in the context of the df data frame.

Let us check what happens if we use the same approach with
Cols(endswith("1")) .=> inspect:

julia> combine(df, Cols(endswith("1")) .=> inspect)
1×2 DataFrame
 Row │ a1_inspect  b1_inspect
     │ Tuple…      Tuple…
─────┼────────────────────────
   1 │ ([1, 2],)   ([3, 4],)

This time we got what we wanted. It seems that this expression had to be
resolved only after it got passed to combine. Let us inspect it:

julia> Cols(endswith("1")) .=> inspect
DataAPI.BroadcastedSelector{Cols{Tuple{Base.Fix2{typeof(endswith), String}}}}(Cols{Tuple{Base.Fix2{typeof(endswith), String}}}((Base.Fix2{typeof(endswith), String}(endswith, "1"),))) => inspect

We see some strange DataAPI.BroadcastedSelector type. Its role is exactly to
delay the final resolution of broadcasted operation only after the expression
is processed in combine. When combine sees a value of type
DataAPI.BroadcastedSelector it does its post-processing in the context
of the df data frame to give us the desired result.

So how does this relate to type piracy? The answer is:

  • we could have made in DataFrames.jl Cols(endswith("1")) .=> inspect
    to have a delayed broadcasting behavior because the Cols selector
    is defined in DataAPI.jl so we can customize how it is handled in broadcasting;
  • we could not do the same for r"a", because the Regex type is defined in
    Base Julia, and it has a defined broadcasting behavior. Packages, like
    DataFrames.jl should not change how r"a" is handled by broadcasting because
    it would be type piracy.

Conclusions

So what do we learn from the today’s post?

  • for DataFrames.jl users: remember that using a regular expression as a column
    selector, while convenient, does not have the nice broadcasting support as
    other selectors, like Cols, Not, Between, and All, have;
  • for general audience: Julia is really flexible and allows packages to
    customize almost everything (in our case how broadcasting works); the only
    limitation is that such customization should be done on the types that you
    define yourself; changing the behavior of types defined in external packages
    is not recommended and it is called type piracy.

If you want to learn what exactly happens when you pass
Cols(endswith("1")) .=> inspect to combine
you can check it here and here.