ABC of handling missing values in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/09/03/missing.html

Introduction

When working with real data one often encounters missing values. This is an
introductory level post aiming to explain the corner cases of working with such
data in the Julia language. It is intended to complement the section on
Missing Values of the Julia Manual. I highly recommend to read it to
everyone interested in the subject and therefore I will skip many topics that
are covered in detail there.

The post was written under Julia 1.6.1 and Missings.jl 1.0.1.

Introducing missing

Missing values are represented in Julia using missing that has type Missing.

As is explained in the section on Missing Values of the Julia Manual:

Julia provides support for representing missing values in the statistical
sense, that is for situations where no value is available for a variable in
an observation, but a valid value theoretically exists.

It is useful to contrast this contract with the intended use of nothing
value that has type Nothing, which should be used when some value objectively
does not exist.

For example findfirst(==(1), 2:3) returns nothing value as there does not
exist an index in the 2:3 range for which the value is equal to 1. On the
other hand if we have some empirical data collected, e.g. about patients in the
clinical trial, and for one of such patients we have not recorded subjects age
then it should be represented as missing (the patient objectively has some
age but we just do not know it).

If we work with data it is convenient to check if some value is missing using
the ismissing function. For example here is a way to drop missing values
from a vector:

julia> x = [1, missing, 3, missing]
4-element Vector{Union{Missing, Int64}}:
 1
  missing
 3
  missing

julia> filter(!ismissing, x)
2-element Vector{Union{Missing, Int64}}:
 1
 3

In this example we have used the !ismissing expression which produces a
function opposite to ismissing, i.e. returning true if the value is not
missing.

Typical problems with missing values

Since missing value follows a three-valued logic the following fails:

julia> findall(==(1), [1, missing, 1, 2])
ERROR: TypeError: non-boolean (Missing) used in boolean context

The reason is that:

julia> 1 == missing
missing

and we can see that the comparison does not produce a valid Bool value.

There are two ways to work around this problem. The first one is to use the
isequal function:

julia> findall(isequal(1), [1, missing, 1, 2])
2-element Vector{Int64}:
 1
 3

The other is to use the coalesce function:

julia> findall(x -> coalesce(x == 1, false), [1, missing, 1, 2])
2-element Vector{Int64}:
 1
 3

It is important to remember that these are not equivalent approaches.
They can differ most notably when working with floating point numbers.
Here is an example:

julia> findall(isequal(NaN), [NaN, missing, -0.0, 0.0, 1.0])
1-element Vector{Int64}:
 1

julia> findall(x -> coalesce(x == NaN, false), [NaN, missing, -0.0, 0.0, 1.0])
Int64[]

julia> findall(isequal(0.0), [NaN, missing, -0.0, 0.0, 1.0])
1-element Vector{Int64}:
 4

julia> findall(x -> coalesce(x == 0.0, false), [NaN, missing, -0.0, 0.0, 1.0])
2-element Vector{Int64}:
 3
 4

Of course one should use the method that is appropriate in the application area.

Corner cases of skipmissings

Typically aggregation functions produce missing when they are passed a collection
holding missing values:

julia> sum([1, missing, 2])
missing

A work-around this issue is to use the skipmissing wrapper that is a lazy
iterator skipping missing values in the passed collection, so the following
works:

julia> sum(skipmissing([1, missing, 2]))
3

It is important to know a corner case of skipmissing when the collection after
skipping missing values is empty. In such a case skipmissing tries to strip
the Missing part from the eltype of the collection, and if it is specific
enough it can be used to produce a proper result of the aggregation. However,
if the type is not specific enough an error is raised, as you can see here:

julia> sum(skipmissing(Union{Int, Missing}[missing, missing, missing]))
0

julia> sum(skipmissing([missing, missing, missing]))
ERROR: ArgumentError: reducing over an empty collection is not allowed

julia> sum(skipmissing(Any[missing, missing, missing]))
ERROR: MethodError: no method matching zero(::Type{Any})

The conclusion is that one should try to use collections of Union{Missing, T},
where T is a concrete type.

In the example above we were affected by one important design decision behind
Missing type. It is a singleton type that is not parametric. The missing
value does not carry information what is the type of the missing value, it
could be any type. Here it is worth to contrast this with e.g. R, where we have
NA, NA_integer_, NA_real_, NA_character_, and NA_complex_ constants
that cover selected, most common R types, so you have the following (run under
R 4.1.1):

> sin(NA)
[1] NA
> sin(NA_character_)
Error in sin(NA_character_) :
  non-numeric argument to mathematical function
> sum(NA, na.rm=T)
[1] 0
> sum(NA_complex_, na.rm=T)
[1] 0+0i
> sum(NA_character_, na.rm=T)
Error in sum(NA_character_, na.rm = T) :
  invalid 'type' (character) of argument

The decision to avoid such differences was deliberate and was meant to simplify
the design of functions working with missing values (at the cost of not
carrying the type information, which has to be managed on user’s side).

You might ask how one can extract T from Union{Missing, T} type. It is
easy using the nonmissingtype function:

ulia> nonmissingtype(Float64)
Float64

julia> nonmissingtype(Union{Missing, Float64})
Float64

julia> nonmissingtype(Any)
Any

julia> nonmissingtype(Union{Missing, Any})
Any

Functions not supporting missing values

As we have seen above many functions produce missing when they are passed
missing as an argument. The rationale is as follows: passed argument is
objectively present but just unknown, so the result of the operation should also
be present, but it is just unknown. Here are some examples of this behavior:

julia> 1 < missing
missing

julia> sin(missing)
missing

However, not all functions follow this rule. Take e.g. an Int constructor:

julia> Int(missing)
ERROR: MethodError: no method matching Int64(::Missing)

So what should we do if we want to convert the vector of integer float or
missing values into a vector of Union{Int, Missing} element type?
Note that the following fails:

julia> Int.([1.0, 2.0, missing])
ERROR: MethodError: no method matching Int64(::Missing)

You can do one of the two things. Either handle the case of missing manually
like this:

julia> [ismissing(x) ? missing : Int(x) for x in [1.0, 2.0, missing]]
3-element Vector{Union{Missing, Int64}}:
 1
 2
  missing

or use the passmissing wrapper that is defined in the Missings.jl package:

julia> using Missings

julia> passmissing(Int).([1.0, 2.0, missing])
3-element Vector{Union{Missing, Int64}}:
 1
 2
  missing

Changing element type of the collections

Very often we have a collection of data whose element type is not like we would
want it to be and we need to perform an appropriate transformation that keeps
the data but just changes the element type. In the context of missing values
there are two such operations.

The first is when we have a collection that does not allow missing values, but
we want another collection that holds the same data, but allows them (e.g.
because later we might want to store missing in such a collection). In such
a case use allowmissing from Missings.jl:

julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> x[1] = missing
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Int64
Closest candidates are:
  convert(::Type{T}, ::Ptr) where T<:Integer at pointer.jl:23
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at number.jl:7
  ...
Stacktrace:
 [1] setindex!(A::Vector{Int64}, x::Missing, i1::Int64)
   @ Base ./array.jl:839
 [2] top-level scope
   @ REPL[63]:1

julia> y = allowmissing(x)
3-element Vector{Union{Missing, Int64}}:
 1
 2
 3

julia> y[1] = missing
missing

julia> y
3-element Vector{Union{Missing, Int64}}:
  missing
 2
 3

An opposite scenario is when we started with a collection allowing missing
values, which were removed from it and now we want it to have a narrower element
type. Here disallowmissing comes to our aid:

julia> x = [1, missing, 2]
3-element Vector{Union{Missing, Int64}}:
 1
  missing
 2

julia> filter!(!ismissing, x)
2-element Vector{Union{Missing, Int64}}:
 1
 2

julia> disallowmissing(x)
2-element Vector{Int64}:
 1
 2

Finally sometimes we might want to create a collection initially filled with
missing values, but allowing additionally some specific type of values.
Unfortunately the fill function will not help us here:

julia> fill(missing, 2)
2-element Vector{Missing}:
 missing
 missing

and we can see that the element type of the produced collection is Missing and
it is too narrow for practical use.

We should use the missings function from Missings.jl instead. For instance:

julia> z = missings(Int, 3)
3-element Vector{Union{Missing, Int64}}:
 missing
 missing
 missing

julia> z[1] = 100
100

julia> z
3-element Vector{Union{Missing, Int64}}:
 100
    missing
    missing

The distinction between collections allowing and not allowing missing values
is quite important in practice, so it is worth remembering the allowmissing,
disallowmissing, and missings functions are available. To wrap up let us
contrast this design decision with R, where storing missing values in typical
scenarios (not in all scenarios though) is supported by default and cannot be
opted-out from.

Conclusions

Most of the topics I have discussed here are standard. However, hopefully for
people starting to work with missing values in the Julia language these examples
can serve as a good additional information on top of what is written in
the section on Missing Values of the Julia Manual.