Re-posted from: https://bkamins.github.io/julialang/2022/06/17/missing.html
Introduction
Some time ago I have written a post about
ABC of handling missing values in Julia. Its objective was to give
an introduction to the topic for the newcomers. However, occasionally
users complain that working with missing values in Julia is less convenient
than in e.g. Python or R.
Such opinions are always debatable, so recently I decided to run a small
pool on Julia Discourse about the skipmissing
function.
The question was if we want to shorten the skipmissing
name into something
that is more convenient to use in interactive work.
To my surprise, a vast majority of voters preferred a verbose and explicit
operation name. This preference regarding handling of missings, to my surprise,
reminded me of several passages from The Zen of Python:
- Explicit is better than implicit.
- Readability counts.
- In the face of ambiguity, refuse the temptation to guess.
So given this preference, how should Julia users retain convenience. Let me
share some of my thoughts on this topic that did not make into my
previous post.
This post was written under Julia 1.7.2, Missings.jl 1.0.2,
MissingsAsFalse.jl 0.1, and StatsBase.jl 0.33.16.
Verbosity
Indeed writing missing
, skipmissing
, and passmissing
(the last one is
defined in Missings.jl), in places where they are needed, might seem verbose.
However, the good news is that most of the time you do not have to type them
fully thanks to completions so in your editor/REPL:
- instead of
missing
writemis<tab>
; - instead of
skipmissing
writeskipm<tab>
; - instead of
passmissing
writepas<tab>
.
Let us compare. In R:
sum(c(1, NA), na.rm=T)
vs Julia:
sum(skipmissing([1, missing]))
seems longer. However, if you take into account the amount of typing you need to
do (number of keystrokes) it is the same.
Missing values in logical conditions
As I have written in this post I personally strongly recommend using
coalesce
to handle missing
values in logical conditions. This allows you,
to follow the Explicit is better than implicit. principle by explicitly
showing in the code if missing
should be treated as true
or as false
.
Here is a short example:
julia> c = missing
missing
julia> c ? "true" : "false"
ERROR: TypeError: non-boolean (Missing) used in boolean context
julia> coalesce(c, false) ? "true" : "false"
"false"
However, some users find it more convenient to use @mfalse
macro from
the MissingsAsFalse.jl package:
julia> using MissingsAsFalse
julia> @mfalse c ? "true" : "false"
"false"
Correlation matrix with missing values
A common, and relatively complex case of handling missing values, is computing
of correlation matrix of data that contains missings. Let us check what Julia
offers here. We will use the pairwise
function from StatsBase.jl.
julia> using Random
julia> using StatsBase
julia> using Statistics
julia> Random.seed!(1234);
julia> x = rand([1:10; missing], 16, 4)
16×4 Matrix{Union{Missing, Int64}}:
4 missing 2 10
7 8 missing 8
3 missing 7 5
10 9 8 10
4 2 7 6
5 6 1 missing
missing 7 8 9
9 3 2 6
6 7 10 missing
9 3 4 6
7 2 9 8
9 7 3 3
1 8 6 5
3 9 4 7
5 7 3 2
8 5 missing 4
julia> pairwise(cor, eachcol(x))
4×4 Matrix{Union{Missing, Float64}}:
1.0 missing missing missing
missing 1.0 missing missing
missing missing 1.0 missing
missing missing missing 1.0
julia> pairwise(cor, eachcol(x), skipmissing=:pairwise)
4×4 Matrix{Float64}:
1.0 -0.218849 -0.0177704 0.0922413
-0.218849 1.0 0.00973122 0.102969
-0.0177704 0.00973122 1.0 0.364821
0.0922413 0.102969 0.364821 1.0
julia> pairwise(cor, eachcol(x), skipmissing=:listwise)
4×4 Matrix{Float64}:
1.0 -0.229568 -0.100028 0.247283
-0.229568 1.0 -0.127153 -0.0420058
-0.100028 -0.127153 1.0 0.67068
0.247283 -0.0420058 0.67068 1.0
By default pairwise
for cor
returns missing
when at least one of the
columns contains missing values. Use :pairwise
value of skipmissing
keyword
argument to skip entries with a missing
value in either of the two vectors
passed to cor
and use :listwise
to skip entries with a missing
value in
any of the vectors passed to pairwise
.
In this example we see that since there are several ways how missing values
should be handled by cor
in pairwise
computations, instead of using the
skipmissing
function, a keyword argument is used allowing to specify what
the data scientist wants exactly.
A similar situation is in the subset
and subset!
functions from
DataFrames.jl, that also take a skipmissing
keyword argument, to simplify
handling of logical conditions specifying which rows from a source data frame
should be kept.
Conclusions
In Julia the approach is that handling of missing values is explicit. This
choice is guided by the fact that then in the code it is explicitly visible what
was the developer’s decision about how they should be treated. This choice makes
code more verbose. Today I have tried to show that, especially with a good
editor/REPL support, in practice it does not introduce a large overhead.