Re-posted from: https://bkamins.github.io/julialang/2022/02/25/anyall.html
Introduction
Recently on Stack Overflow there was an interesting question about
row-wise aggregation of boolean columns in DataFrames.jl.
I think that the solution is an interesting exercise in functionalities of
DataFrames.jl.
The post was written under Julia 1.7 and DataFrames.jl 1.3.2.
The question
Assume we have the following data frame:
julia> using DataFrames, Random
julia> Random.seed!(1234);
julia> df = DataFrame(rand(Bool, 10, 3), :auto)
10×3 DataFrame
Row │ x1 x2 x3
│ Bool Bool Bool
─────┼─────────────────────
1 │ true false true
2 │ false false true
3 │ false false false
4 │ true false false
5 │ true true false
6 │ true true true
7 │ true false true
8 │ true false false
9 │ false false false
10 │ true false true
We now would like to apply all
and any
functions to this data frame
row-wise. The basic approach to this task would be to do the following solution:
julia> transform(df, AsTable(r"x") .=> ByRow.([all, any]))
10×5 DataFrame
Row │ x1 x2 x3 x1_x2_x3_all x1_x2_x3_any
│ Bool Bool Bool Bool Bool
─────┼─────────────────────────────────────────────────
1 │ true false true false true
2 │ false false true false true
3 │ false false false false false
4 │ true false false false true
5 │ true true false false true
6 │ true true true true true
7 │ true false true false true
8 │ true false false false true
9 │ false false false false false
10 │ true false true false true
The problem with this approach is that it does not scale well when you have
many columns in your data frame:
julia> df2 = DataFrame(rand(Bool, 10_000, 1000), :auto);
julia> @time select(df2, AsTable(r"x") .=> ByRow.([all, any]));
12.026759 seconds (1.13 M allocations: 730.530 MiB, 0.39% gc time, 42.79% compilation time)
julia> @time select(df2, AsTable(r"x") .=> ByRow.([all, any]));
6.746910 seconds (145.42 k allocations: 678.763 MiB, 0.29% gc time)
As you can see it has both high compilation cost and high execution time.
Speeding up the computation
The basic way to resolve this issue is to use the composition with collect
, as
the DataFrames.table_transformation
documentation suggests us. In this way we
avoid constructing a very wide NamedTuple
, which is problematic in this
case.
julia> @time select(df2, AsTable(r"x") .=> ByRow.([all, any] .∘ collect));
0.616105 seconds (893.18 k allocations: 69.745 MiB, 3.26% gc time, 80.07% compilation time)
julia> @time select(df2, AsTable(r"x") .=> ByRow.([all, any] .∘ collect));
0.129053 seconds (26.39 k allocations: 21.293 MiB, 6.65% gc time)
This is much better. However, we can further notice that any
function is
essentially the same as maximum
, and all
the same as minimum
for boolean
data. In this case we do not have to use the collect
composition:
julia> @time select(df2, AsTable(r"x") .=> ByRow.([minimum, maximum]));
0.393608 seconds (783.29 k allocations: 42.571 MiB, 2.45% gc time, 96.49% compilation time)
julia> @time select(df2, AsTable(r"x") .=> ByRow.([minimum, maximum]));
0.002593 seconds (5.37 k allocations: 503.359 KiB)
This time things are very fast, as row-wise aggregation for maximum
and minimum
is optimized.
Conclusions
There are the following lessons from the examples we have given today:
- when performing row-wise aggregation the default
AsTable
timings can be bad,
because they construct an intermediateNamedTuple
, which is expensive
(of course sometimes it is useful, if you want to refer to column names in
transformations – that is why this form is supported); - to avoid construction of
NamedTuple
use composition withcollect
, which
uses a faster path that passes a vector to the aggregation function directly; - an even faster option is available for selected list of most common aggregation
functions, which includes:sum
,length
,mean
,var
,std
,median
,
minimum
, andmaximum
(all potentially composed withskipmissing
).