Re-posted from: https://bkamins.github.io/julialang/2022/01/28/subset.html
Introduction
Recently on Julia Slack there was a question about using the subset
function
to drop whole groups from GroupedDataFrame
in DataFrames.jl.
I thought that indeed this case is tricky enough to be worth a post.
The examples were tested under Julia 1.7.0 and DataFrames.jl 1.3.2.
Standard use cases of the subset
function
Let us start with creating some sample data:
julia> using DataFrames
julia> df = DataFrame(id=[1, 1, 1, 1, 2, 2], x=1:6)
6×2 DataFrame
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
3 │ 1 3
4 │ 1 4
5 │ 2 5
6 │ 2 6
julia> gdf = groupby(df, :id)
GroupedDataFrame with 2 groups based on key: id
First Group (4 rows): id = 1
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
3 │ 1 3
4 │ 1 4
⋮
Last Group (2 rows): id = 2
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 2 5
2 │ 2 6
Assume we want to keep rows having value of :x
less than the mean of this
column from df
. This can be achieved with:
julia> using Statistics
julia> subset(df, :x => x -> x .< mean(x))
3×2 DataFrame
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
3 │ 1 3
The same operation can be easily done groupwise. Now we keep rows that have the
value of :x
less than the mean of this column per group defined by :id
:
julia> subset(gdf, :x => x -> x .< mean(x))
3×2 DataFrame
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
3 │ 2 5
The limitation of the subset
contract
The subset
function requires that the return value of the passed condition
is a vector. Therefore the following operation fails:
julia> subset(df, :x => x -> true)
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.
although we might expect that broadcasting would be applied to the result of
the function and all rows would be kept. For a reference e.g. select
would
perform such broadcasting automatically:
julia> select(df, All(), :x => x -> true)
6×3 DataFrame
Row │ id x x_function
│ Int64 Int64 Bool
─────┼──────────────────────────
1 │ 1 1 true
2 │ 1 2 true
3 │ 1 3 true
4 │ 1 4 true
5 │ 2 5 true
6 │ 2 6 true
You might wonder why this restriction is made. Initially we allowed non-vector
return values, but they turned to be confusing for the users so we disallowed
them.
Let me give an example. If the user wants to keep all rows for which the :id
column is equal to 1
one should write:
julia> subset(df, :id => ByRow(==(1)))
4×2 DataFrame
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
3 │ 1 3
4 │ 1 4
However, it turned out that users frequently were forgetting to add ByRow
wrapper and instead used:
julia> subset(df, :id => ==(1))
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.
Now it throws an error, but if we have not imposed the restriction that we require
a vector to be returned we would get the following result:
julia> subset(df, :id => x -> fill(x == 1, length(x)))
0×2 DataFrame
as the whole column :id
would be compared to 1
and the result of this
comparison is false
.
Dropping whole groups from a GroupedDataFrame
The requirement that the condition must return a vector was added for safety
reasons. However, there is one case when it is a bit problematic.
Assume we want to keep from the gdf
GroupedDataFrame
all groups for which
the mean of :x
column is less than 3
. The problem is that the following
condition fails:
julia> subset(gdf, :x => x -> mean(x) < 3)
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.
since the comparing the mean of the :x
column to 3
produces a scalar Bool
value.
The solution is to manually expand the result of the condition to match the
number of rows in the group:
julia> subset(gdf, :x => x -> fill(mean(x) < 3, length(x)))
4×2 DataFrame
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
3 │ 1 3
4 │ 1 4
This is unfortunately a bit inconvenient.
An alternative approach would be to use the filter
function which applied
to GroupedDataFrame
always works on whole groups:
julia> filter(:x => x -> mean(x) < 3, gdf) |> DataFrame
4×2 DataFrame
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
3 │ 1 3
4 │ 1 4
(we had to pass the result of filter
to DataFrame
constructor, as otherwise
we would get a filtered GroupedDataFrame
)
Conclusions
The design of subset
I discussed in this post shows one of the challenges we
face when defining APIs in DataFrames.jl. There often is a tension between
developer convenience and safety. In this example allowing only vectors as
results of conditions in the subset
function is safer since it allows to
catch some common bugs in the users code. The cost is that in some cases
(most notably dropping whole groups from a GroupedDataFrame
) it is a bit
inconvenient.