Any and all reduction of rows in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/02/25/anyall.html

Introduction

Recently on Stack Overflow there was an interesting question about
row-wise aggregation of boolean columns in DataFrames.jl.

I think that the solution is an interesting exercise in functionalities of
DataFrames.jl.

The post was written under Julia 1.7 and DataFrames.jl 1.3.2.

The question

Assume we have the following data frame:

julia> using DataFrames, Random

julia> Random.seed!(1234);

julia> df = DataFrame(rand(Bool, 10, 3), :auto)
10×3 DataFrame
 Row │ x1     x2     x3
     │ Bool   Bool   Bool
─────┼─────────────────────
   1 │  true  false   true
   2 │ false  false   true
   3 │ false  false  false
   4 │  true  false  false
   5 │  true   true  false
   6 │  true   true   true
   7 │  true  false   true
   8 │  true  false  false
   9 │ false  false  false
  10 │  true  false   true

We now would like to apply all and any functions to this data frame
row-wise. The basic approach to this task would be to do the following solution:

julia> transform(df, AsTable(r"x") .=> ByRow.([all, any]))
10×5 DataFrame
 Row │ x1     x2     x3     x1_x2_x3_all  x1_x2_x3_any
     │ Bool   Bool   Bool   Bool          Bool
─────┼─────────────────────────────────────────────────
   1 │  true  false   true         false          true
   2 │ false  false   true         false          true
   3 │ false  false  false         false         false
   4 │  true  false  false         false          true
   5 │  true   true  false         false          true
   6 │  true   true   true          true          true
   7 │  true  false   true         false          true
   8 │  true  false  false         false          true
   9 │ false  false  false         false         false
  10 │  true  false   true         false          true

The problem with this approach is that it does not scale well when you have
many columns in your data frame:

julia> df2 = DataFrame(rand(Bool, 10_000, 1000), :auto);

julia> @time select(df2, AsTable(r"x") .=> ByRow.([all, any]));
 12.026759 seconds (1.13 M allocations: 730.530 MiB, 0.39% gc time, 42.79% compilation time)

julia> @time select(df2, AsTable(r"x") .=> ByRow.([all, any]));
  6.746910 seconds (145.42 k allocations: 678.763 MiB, 0.29% gc time)

As you can see it has both high compilation cost and high execution time.

Speeding up the computation

The basic way to resolve this issue is to use the composition with collect, as
the DataFrames.table_transformation documentation suggests us. In this way we
avoid constructing a very wide NamedTuple, which is problematic in this
case.

julia> @time select(df2, AsTable(r"x") .=> ByRow.([all, any] .∘ collect));
  0.616105 seconds (893.18 k allocations: 69.745 MiB, 3.26% gc time, 80.07% compilation time)

julia> @time select(df2, AsTable(r"x") .=> ByRow.([all, any] .∘ collect));
  0.129053 seconds (26.39 k allocations: 21.293 MiB, 6.65% gc time)

This is much better. However, we can further notice that any function is
essentially the same as maximum, and all the same as minimum for boolean
data. In this case we do not have to use the collect composition:

julia> @time select(df2, AsTable(r"x") .=> ByRow.([minimum, maximum]));
  0.393608 seconds (783.29 k allocations: 42.571 MiB, 2.45% gc time, 96.49% compilation time)

julia> @time select(df2, AsTable(r"x") .=> ByRow.([minimum, maximum]));
  0.002593 seconds (5.37 k allocations: 503.359 KiB)

This time things are very fast, as row-wise aggregation for maximum
and minimum is optimized.

Conclusions

There are the following lessons from the examples we have given today:

  • when performing row-wise aggregation the default AsTable timings can be bad,
    because they construct an intermediate NamedTuple, which is expensive
    (of course sometimes it is useful, if you want to refer to column names in
    transformations – that is why this form is supported);
  • to avoid construction of NamedTuple use composition with collect, which
    uses a faster path that passes a vector to the aggregation function directly;
  • an even faster option is available for selected list of most common aggregation
    functions, which includes: sum, length, mean, var, std, median,
    minimum, and maximum (all potentially composed with skipmissing).