DataFrames.jl joins: matchmissing=:notequal

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/06/05/notequal.html

Introduction

In DataFrames.jl we have recently added in this PR a new
option for matchmissing keyword argument in joins. This functionality will be
made available in 1.2 release. In this post I want to discuss this new feature
before we release it.

The post is tested under Julia 1.6.1 and on DataFrames.jl main branch
(that includes the relevant PR).

How matchmissing keyword argument works

The matchmissing keyword argument allows the user to decide how missing
value is handled in on columns in joins. After this PR you have
three options to choose from:

  • :error (the default): throw an error if missing value is present in any of
    the on columns; the rationale is that missing indicates unknown value so
    if we knew it it could match to any of the non-missing values in the on
    columns in the other data frame we join;
  • :equal: missing values are allowed and they are matched to missing values
    only; in this scenario we treat missing as any other value without giving
    it a special treatment;
  • :notequal (a new option): in this case missing is considered to be not
    equal to any other value (including missing).

Let me comment a bit more on the consequences of the :notequal rule. In
innerjoin this means that rows with missing values will be dropped both in
left and right table. In leftjoin, semijoin and antijoin they are dropped
from the right table only (which means that if missing is present in the left
table it is retained in processing but considered not to match any row in right
table). Similarly in rightjon rows with missing are dropped from left table
only. The case that is most difficult to handle is outerjoin. The reason is
that if missing would be present in both left and right table they would be
considered not equal and produce separate rows in the output table. We
considered this behavior as potentially confusing and therefore decided not to
allow :notequal in outerjoin.

Let me move to the examples showing the matchmissing=:notequal at work.

Examples

Here is a simple example code showing how the new option works:

julia> using DataFrames

julia> df1 = DataFrame(id=[1, missing, 3, 4], x=1:4)
4×2 DataFrame
 Row │ id       x
     │ Int64?   Int64
─────┼────────────────
   1 │       1      1
   2 │ missing      2
   3 │       3      3
   4 │       4      4

julia> df2 = DataFrame(id=[1, 2, missing, 4], y=1:4)
4×2 DataFrame
 Row │ id       y
     │ Int64?   Int64
─────┼────────────────
   1 │       1      1
   2 │       2      2
   3 │ missing      3
   4 │       4      4

Now we investigate all the possible join operations:

julia> innerjoin(df1, df2, on=:id, matchmissing=:notequal)
2×3 DataFrame
 Row │ id      x      y
     │ Int64?  Int64  Int64
─────┼──────────────────────
   1 │      1      1      1
   2 │      4      4      4

As you can see for innerjoin only rows with :id equal to 1 and 4 were
retained. Let us move forward:

julia> leftjoin(df1, df2, on=:id, matchmissing=:notequal, source=:source)
4×4 DataFrame
 Row │ id       x      y        source
     │ Int64?   Int64  Int64?   String
─────┼────────────────────────────────────
   1 │       1      1        1  both
   2 │       4      4        4  both
   3 │ missing      2  missing  left_only
   4 │       3      3  missing  left_only

julia> rightjoin(df1, df2, on=:id, matchmissing=:notequal, source=:source)
4×4 DataFrame
 Row │ id       x        y      source
     │ Int64?   Int64?   Int64  String
─────┼─────────────────────────────────────
   1 │       1        1      1  both
   2 │       4        4      4  both
   3 │       2  missing      2  right_only
   4 │ missing  missing      3  right_only

For leftjoin and rightjoin we retain missing but only in the table for
which all rows must be retained. Therefore in leftjoin for :id equal to
missing we have :x equal to 2, but :y equal to missing (signaling that
there was no match which we can also see in :source column). The same
happens for :id equal to missing in rightjoin, but then :x is set to
missing.

The same rules work with semijoin and antijoin as you can see here:

julia> semijoin(df1, df2, on=:id, matchmissing=:notequal)
2×2 DataFrame
 Row │ id      x
     │ Int64?  Int64
─────┼───────────────
   1 │      1      1
   2 │      4      4

julia> antijoin(df1, df2, on=:id, matchmissing=:notequal)
2×2 DataFrame
 Row │ id       x
     │ Int64?   Int64
─────┼────────────────
   1 │ missing      2
   2 │       3      3

Finally outerjoin just throws an error:

julia> outerjoin(df1, df2, on=:id, matchmissing=:notequal)
ERROR: ArgumentError: matchmissing == :notequal for `outerjoin` is not allowed

Conclusions

I hope this post helped you to learn the rationale and design of the new option
for the matchmissing keyword argument in joins. If you have any comments on
the functionality or its documentation please open an issue on DataFrames.jl GitHub repository.

Finally I would like to thank pstorozenko, nilshg, and
nalimilan for working on this functionality.