How is equality checked in DataFrames.jl?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/04/14/isequal.html

Introduction

Today I want to discuss how values are tested for being equal in
functions provided by DataFrames.jl.

I already discussed the topic of equality testing in the past in
this post and this post and
explain it extensively in chapter 7 of my Julia for Data Analysis book.
However, the issue is still often raised by users, so I thought it is useful
to go back to it one more time.

The post was written under Julia 1.9.0-rc1, CategoricalArrays.jl 0.10.7 and DataFrames.jl 1.5.0.

Why equality testing is hard?

When users learn Julia they are typically taught that == is the operator
that should be used for testing for equality. Here is a basic example:

julia> 1 == 2
false

julia> 1 == 1
true

However, there are the following aspects of == that make it not intuitive
in some scenarios.

First is that == does not guarantee to return Bool value. The problem
is that if one of its arguments is missing then the result will be missing:

julia> 1 == missing
missing

julia> missing == missing
missing

Clearly this is not desirable in cases when we expect the operation to return Bool (e.g. when filtering data):

julia> x = [1, 2, missing, 4, 5]
5-element Vector{Union{Missing, Int64}}:
 1
 2
  missing
 4
 5

julia> x[x .> 2.5]
ERROR: ArgumentError: unable to check bounds for indices of type Missing

In such cases you should use the coalesce function to decide if you want to keep or drop missing values:

julia> x[coalesce.(x .> 2.5, false)]
2-element Vector{Union{Missing, Int64}}:
 4
 5

julia> x[coalesce.(x .> 2.5, true)]
3-element Vector{Union{Missing, Int64}}:
  missing
 4
 5

The second issue is that == follows IEEE semantics for floating-point numbers:

julia> NaN == NaN
false

julia> 0.0 == -0.0
true

First, you see that NaN is not considered to be equal to NaN. This can be quite surprising:

julia> x = [1, NaN, 2]
3-element Vector{Float64}:
   1.0
 NaN
   2.0

julia> x[x .== NaN]
Float64[]

instead the isnan function can be used:

julia> x[isnan.(x)]
1-element Vector{Float64}:
 NaN

The 0.0 and -0.0 case is even more tricky. These are two technically distinct values,
however, in some applications user might want them to be treated as equal, while in other
as not equal. IEEE standard determines that they are considered to be equal when compared
using ==.

In summary, the major problem with == is that it does not define a proper equivalence
relation. First, some values are not comparable (missing is returned); second
it is not reflexive (for NaN).

An alternative way to compare values

In many cases you need an equality operator that defines an equivalence relation.
In Julia this is provided by the isequal function. As you can read in its documentation:

isequal treats all floating-point NaN values as equal to each other,
treats -0.0 as unequal to 0.0, and missing as equal to missing.
Always returns a Bool value.

Let us check this:

julia> isequal(1, missing)
false

julia> isequal(missing, missing)
true

julia> isequal(NaN, NaN)
true

julia> isequal(0.0, -0.0)
false

In Julia functions that create equivalence classes over sets of some values use
isequal to test for equality. In Base Julia such are for example Dict and Set
operations or the unique function:

julia> Set([0.0, 0.0, -0.0, -0.0, NaN, NaN, missing, missing])
Set{Union{Missing, Float64}} with 4 elements:
  0.0
  NaN
  -0.0
  missing

julia> unique([0.0, 0.0, -0.0, -0.0, NaN, NaN, missing, missing])
4-element Vector{Union{Missing, Float64}}:
   0.0
  -0.0
 NaN
    missing

The same rules carry over to DataFrames.jl.

Testing for equality in DataFrames.jl

There are the following functionalities of DataFrames.jl that rely on the isequal equality test:

  • deduplication with unique and related functions;
  • grouping with groupby;
  • joins (innerjoin etc.).

Let us see them in action one by one. We start with the deduplication:

julia> using DataFrames

julia> df = DataFrame(id=1:8, x=[0.0, 0.0, -0.0, -0.0, NaN, NaN, missing, missing])
8×2 DataFrame
 Row │ id     x
     │ Int64  Float64?
─────┼──────────────────
   1 │     1        0.0
   2 │     2        0.0
   3 │     3       -0.0
   4 │     4       -0.0
   5 │     5      NaN
   6 │     6      NaN
   7 │     7  missing
   8 │     8  missing

julia> unique(df, :x)
4×2 DataFrame
 Row │ id     x
     │ Int64  Float64?
─────┼──────────────────
   1 │     1        0.0
   2 │     3       -0.0
   3 │     5      NaN
   4 │     7  missing

Indeed, we see that 0.0 and -0.0 are considered as not equal,
while NaN and missing are deduplicated.

Now let us turn to grouping:

julia> show(groupby(df, :x), allgroups=true)
GroupedDataFrame with 4 groups based on key: x
Group 1 (2 rows): x = 0.0
 Row │ id     x
     │ Int64  Float64?
─────┼─────────────────
   1 │     1       0.0
   2 │     2       0.0
Group 2 (2 rows): x = -0.0
 Row │ id     x
     │ Int64  Float64?
─────┼─────────────────
   1 │     3      -0.0
   2 │     4      -0.0
Group 3 (2 rows): x = NaN
 Row │ id     x
     │ Int64  Float64?
─────┼─────────────────
   1 │     5       NaN
   2 │     6       NaN
Group 4 (2 rows): x = missing
 Row │ id     x
     │ Int64  Float64?
─────┼─────────────────
   1 │     7   missing
   2 │     8   missing

As you can see we get the same result again. As a side note let me
comment that unique internally uses the same mechanism as groupby
to identify duplicates.

Finally let us check joins:

julia> df_ref = DataFrame(x=[0.0, missing], val=1:2)
2×2 DataFrame
 Row │ x          val
     │ Float64?   Int64
─────┼──────────────────
   1 │       0.0      1
   2 │ missing        2

julia> outerjoin(df, df_ref, on=:x)
ERROR: ArgumentError: missing values in key columns are not allowed when matchmissing == :error

We get a first problem. Joins detect that missing value is present in key column.
By default it errors in such a case. We can change it using the matchmissing keyword argument.
Let us assume that we want missing values to be treated as equal and try the following join:

julia> outerjoin(df, df_ref, on=:x, matchmissing=:equal)
ERROR: ArgumentError: currently for numeric values NaN and `-0.0` in their real or imaginary components are not allowed. Use CategoricalArrays.jl to wrap these values in a CategoricalVector to perform the requested join.

We still get an error. In joins, for safety, if -0.0 is encountered in key then an error is thrown. This can be fixed by transforming the :x column to categorical,
in which case -0.0 and 0.0 are considered to be different:

julia> using CategoricalArrays

julia> outerjoin(transform(df, :x => categorical => :x), df_ref, on=:x, matchmissing=:equal)
8×3 DataFrame
 Row │ id      x          val
     │ Int64?  Float64?   Int64?
─────┼────────────────────────────
   1 │      1        0.0        1
   2 │      2        0.0        1
   3 │      7  missing          2
   4 │      8  missing          2
   5 │      3       -0.0  missing
   6 │      4       -0.0  missing
   7 │      5      NaN    missing
   8 │      6      NaN    missing

Let us check categorical vector in more detail:

julia> levels(categorical(df.x))
3-element Vector{Float64}:
  -0.0
   0.0
 NaN

As you can see 0.0 and -0.0 are considered to be separate levels in a categorical vector.

Conclusions

I hope the examples given today were useful for understanding how == and isequal work in Julia.

As a final comment let me add that throwing an error for joins on -0.0 was a decision that was made
for safety reasons. However, if users give us a feedback that adding other options of handling -0.0
would be useful (e.g. treating them as equal or not-equal) then we could consider adding this feature
in the future releases of DataFrames.jl.