Sorting data with missing values

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/04/12/sorting.html

Introduction

Sorting is one of the most common operations one wants to do with collections.
In this post I discuss how one can sort data that contain missing values.

The post was written under Julia 1.10.1 and Missings.jl 1.2.0.

General rules of comparison with missing values

By default missing is considered as greater than any other different value it is compared with:

julia> isless(Inf, missing)
true

julia> isless("abc", missing)
true

julia> isless(r"abc", missing)
true

Note, in particular, the last case. Although Regex does not support comparisons it can be compared to missing.
The reason is that isless has a general catch-all definition when one of the arguments is missing. Let us see it:

isless(::Missing, ::Missing) = false
isless(::Missing, ::Any) = false
isless(::Any, ::Missing) = true

The rule that missing is greater than all else has an important consequence when sorting.

Default sorting with missing values

Let us create a simple vector containing missing values:

julia> x = [missing, 3, 1, missing, 2, 4, missing]
7-element Vector{Union{Missing, Int64}}:
  missing
 3
 1
  missing
 2
 4
  missing

If we sort it missing values end up at the end of the produced vector
because, by default, sorting is done in ascending order:

julia> sort(x)
7-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
  missing
  missing
  missing

If we want to get values in descending order missing values come first:

julia> sort(x, rev=true)
7-element Vector{Union{Missing, Int64}}:
  missing
  missing
  missing
 4
 3
 2
 1

But what if we wanted to have values sorted in descending order, but put missing at the end?

Supplementary sorting order

Users often wanted a functionality that would allow them to sort values, but treat missing
as the smallest. This means that if you sort your data in a descending order missing would be put at the end.
Similarly, if you want to sort your data in ascending order missing would be put at the beginning.

With Missings.jl release 1.2 this functionality is supported with the missingsmallest function:

julia> sort(x, lt=missingsmallest)
7-element Vector{Union{Missing, Int64}}:
  missing
  missing
  missing
 1
 2
 3
 4

julia> sort(x, lt=missingsmallest, rev=true)
7-element Vector{Union{Missing, Int64}}:
 4
 3
 2
 1
  missing
  missing
  missing

By default missingsmallest uses the isless comparison.

More advanced cases of treating missing as smallest

Assume that you have the following vector that you want to sort by
the length of the string:

julia> s = [missing, "abc", "x", missing, "bcde", "pq", missing]
7-element Vector{Union{Missing, String}}:
 missing
 "abc"
 "x"
 missing
 "bcde"
 "pq"
 missing

If you try a simple way to do it you get an error:

julia> sort(s, by=length)
ERROR: MethodError: no method matching length(::Missing)

We need to wrap length in passmissing to get what we want:

julia> sort(s, by=passmissing(length))
7-element Vector{Union{Missing, String}}:
 "x"
 "pq"
 "abc"
 "bcde"
 missing
 missing
 missing

julia> sort(s, by=passmissing(length), rev=true)
7-element Vector{Union{Missing, String}}:
 missing
 missing
 missing
 "bcde"
 "abc"
 "pq"
 "x"

But what if we wanted to treat missing values as smallest?

The first approach is the one we already know:

julia> sort(s, by=passmissing(length), lt=missingsmallest)
7-element Vector{Union{Missing, String}}:
 missing
 missing
 missing
 "x"
 "pq"
 "abc"
 "bcde"

julia> sort(s, by=passmissing(length), lt=missingsmallest, rev=true)
7-element Vector{Union{Missing, String}}:
 "bcde"
 "abc"
 "pq"
 "x"
 missing
 missing
 missing

However, there is an alternative. You can define a comparison function that works on strings:

julia> isshorter(s1::AbstractString, s2::AbstractString) = length(s1) < length(s2)
isshorter (generic function with 1 method)

Then you can pass the isshorter function to missingsmallest
as a single argument to generate a comparison function
that automatically treats missing values as smallest:

julia> sort(s, lt=missingsmallest(isshorter))
7-element Vector{Union{Missing, String}}:
 missing
 missing
 missing
 "x"
 "pq"
 "abc"
 "bcde"

julia> sort(s, lt=missingsmallest(isshorter), rev=true)
7-element Vector{Union{Missing, String}}:
 "bcde"
 "abc"
 "pq"
 "x"
 missing
 missing
 missing

Conclusions

The missingsmallest functionality was added in Missings.jl 1.2.
I hope you will find it useful when working with your data!

Deduplication of rows in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/04/05/duplicates.html

Introduction

Deduplication of rows in a table is one of the basic functionalities that
is often needed when working with data frames. Today I discuss the
allunique, nonunique, unique, and unique! functions that
are provided by DataFrames.jl and can help you with this task.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Checking if a data frame has duplicate rows

Let us start with discussing how one can check if a data frame has duplicate rows
as this is the simplest check and the functionalities that we discuss here
carry-over to other functions that we discuss later.

First create a simple data frame:

julia> using DataFrames

julia> df = DataFrame(x=1:6, y=[1.0, 2.0, 1.0, 2.0, 0.0, -0.0])
6×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     3      1.0
   4 │     4      2.0
   5 │     5      0.0
   6 │     6     -0.0

By just calling the allunique function we can check if whole rows of this data frame are unique:

julia> allunique(df)
true

In this case we get true as indeed all rows are unique. It is guaranteed by the column "x" which holds
consecutive integers.

However, we can pass a second positional argument to allunique. In this case we can narrow down the list of
checked columns:

julia> allunique(df, "y")
false

Here we checked uniqueness of only column "y", which contains duplicates, e.g. row 1 and row 3 contain the same value 1.0,
so we got false.

But this is not all. The second positional argument can be any transformation that is supported by the select function.
Therefore, for example, we can run:

julia> allunique(df, "x" => ByRow(iseven))
false

We got false, as applying the iseven to the x column creates duplicates since we have multiple even and odd values in it.
But e.g. we have:

julia> allunique(df, "x" => ByRow(x -> x^2))
true

Now we get true as squares of consecutive integers are unique.

We can pass several transformations as well:

julia> allunique(df, ["x" => ByRow(x -> mod(x, 3)), "y" => identity])
true

To convince ourselves that the true result is correct let us run the select operation with the same argument:

julia> select(df, ["x" => ByRow(x -> mod(x, 3)), "y" => identity])
6×2 DataFrame
 Row │ x_function  y_identity
     │ Int64       Float64
─────┼────────────────────────
   1 │          1         1.0
   2 │          2         2.0
   3 │          0         1.0
   4 │          1         2.0
   5 │          2         0.0
   6 │          0        -0.0

Indeed the rows produced by this operation are unique.

Finding duplicate rows

To get a vector with indicators of duplicate rows in a data frame use the nonunique function. Here are three examples of its usage
(note it also can take a second positional argument just like allunique):

julia> nonunique(df)
6-element Vector{Bool}:
 0
 0
 0
 0
 0
 0

All rows are unique in df, as we already know, so we got a vector of falses in the call above.

Now the second example:

julia> nonunique(df, "x" => ByRow(iseven))
6-element Vector{Bool}:
 0
 0
 1
 1
 1
 1

Here we see that we get true for all rows for which there was already a duplicate row before. So first two rows get false (non-duplicated)
and the following rows have the true indicator (as we have already seen an even and an odd number in column "x").

Now look at the last example:

julia> nonunique(df, "y")
6-element Vector{Bool}:
 0
 0
 1
 1
 0
 0

You might be surprised by the last false. The reason is that all the de-duplication functions use isequal to compare values for equality,
and 0.0 is not equal to -0.0 in this comparison:

julia> isequal(0.0, -0.0)
false

This behavior matches the way how dictionaries work in Julia.

Additionally the nonunique has a keep keyword argument. It allows us to change the default behavior which rows are marked as duplicate.
If we pass keep=:last then the last of the duplicated rows is marked as unique. See for example:

julia> nonunique(df, "x" => ByRow(iseven); keep=:last)
6-element Vector{Bool}:
 1
 1
 1
 1
 0
 0

We get false in last two rows as 5 and 6 are last even and odd numbers respectively.

The third option is keep=:noduplicates in which case only rows that have no duplicates are marked as unique. So we have:

julia> nonunique(df, "x" => ByRow(iseven); keep=:noduplicates)
6-element Vector{Bool}:
 1
 1
 1
 1
 1
 1

as no row was truly unique, but we have:

julia> nonunique(df, "y"; keep=:noduplicates)
6-element Vector{Bool}:
 1
 1
 1
 1
 0
 0

as first four rows were duplicated, but rows with 0.0 and -0.0 are indeed unique.

Removing duplicate rows from a data frame

The nonunique function returns a vector of duplicate indicators. Often we just want to get rid of them from our data frame.
The unique and unique! functions can be used to perform this operation. They support the same arguments as nonunique.
You have three options how you cen get your result:

  • using unique you get a new data frame by default;
  • using unique with view=true keyword argument passed you get a view of the source data frame with duplicates removed;
  • using unique! you drop the duplicates in-place from the source data frame.

Let us see how it works. First plain unique:

julia> unique(df, "y")
4×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     5      0.0
   4 │     6     -0.0

We got a new data frame. The df data frame is unchanged. The second option is a view:

julia> unique(df, "y"; view=true)
4×2 SubDataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     5      0.0
   4 │     6     -0.0

Note that still df is untouched:

julia> df
6×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     3      1.0
   4 │     4      2.0
   5 │     5      0.0
   6 │     6     -0.0

And finally we can change the df data frame in place:

julia> unique!(df, "y")
4×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     5      0.0
   4 │     6     -0.0

julia> df
4×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     5      0.0
   4 │     6     -0.0

In this case, as you can see, the df data frame was updated.

Conclusions

I hope that you will find this review of the functionalities of the
allunique, nonunique, unique, and unique! functions useful.

As a summary remember that:

  • You can determine uniqueness of rows based on transformations of data contained in the source data frame.
  • You can decide which rows are marked as duplicate using the keep keyword argument.

Getting full factorial design in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/03/29/ffd.html

Introduction

Often when working with data we need to get all possible combinations of some
input factors in a data frame. In the field of design of experiments
this is called full factorial design. In this post I will discuss two functions
that DataFrames.jl provides that can help you to generate such designs if
you needed them.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

What is a full factorial design and how to create it?

Assume we are a cardboard box producer have three factors describing a box: its width, height, and depth.
Each of them has a finite set of possible values (due to production process limitations).
Let us create some sample data of this kind:

julia> height = [10, 12]
2-element Vector{Int64}:
 10
 12

julia> width = [8, 10, 15]
3-element Vector{Int64}:
  8
 10
 15

julia> depth = [5, 6]
2-element Vector{Int64}:
 5
 6

Our task is to compute the volume of all possible boxes that can be created by our factory.
The list of all possible cardboard box configurations is a full factorial design.
You can get an iterator of these values by using the Iterators.product function:

julia> Iterators.product(height, width, depth)
Base.Iterators.ProductIterator{Tuple{Vector{Int64}, Vector{Int64}, Vector{Int64}}}(([10, 12], [8, 10, 15], [5, 6]))

This function is lazy, to see the result we need to materialize its return value using e.g. collect:

julia> collect(Iterators.product(height, width, depth))
2×3×2 Array{Tuple{Int64, Int64, Int64}, 3}:
[:, :, 1] =
 (10, 8, 5)  (10, 10, 5)  (10, 15, 5)
 (12, 8, 5)  (12, 10, 5)  (12, 15, 5)

[:, :, 2] =
 (10, 8, 6)  (10, 10, 6)  (10, 15, 6)
 (12, 8, 6)  (12, 10, 6)  (12, 15, 6)

We can see that we get an array of tuples of all possible combinations of dimensions. Let us now compute the volumes:

julia> prod.(collect(Iterators.product(height, width, depth)))
2×3×2 Array{Int64, 3}:
[:, :, 1] =
 400  500  750
 480  600  900

[:, :, 2] =
 480  600   900
 576  720  1080

The results are nice and efficient. However, sometimes it is more convenient to have this data in a data frame.

Full factorial design in DataFrames.jl

Let us repeat the exercise using DataFrames.jl:

julia> using DataFrames

julia> df = allcombinations(DataFrame; height, width, depth)
12×3 DataFrame
 Row │ height  width  depth
     │ Int64   Int64  Int64
─────┼──────────────────────
   1 │     10      8      5
   2 │     12      8      5
   3 │     10     10      5
   4 │     12     10      5
   5 │     10     15      5
   6 │     12     15      5
   7 │     10      8      6
   8 │     12      8      6
   9 │     10     10      6
  10 │     12     10      6
  11 │     10     15      6
  12 │     12     15      6

Note that we passed height, width, depth as keyword arguments to allcombinations taking advantage of a nice functionality of Julia that in this case we can avoid writing height=height as just writing height gives us the same result.

Now we can add a volume column:

julia> transform!(df, All() => ByRow(*) => "volume")
12×4 DataFrame
 Row │ height  width  depth  volume
     │ Int64   Int64  Int64  Int64
─────┼──────────────────────────────
   1 │     10      8      5     400
   2 │     12      8      5     480
   3 │     10     10      5     500
   4 │     12     10      5     600
   5 │     10     15      5     750
   6 │     12     15      5     900
   7 │     10      8      6     480
   8 │     12      8      6     576
   9 │     10     10      6     600
  10 │     12     10      6     720
  11 │     10     15      6     900
  12 │     12     15      6    1080

We have added the "volume" column in place to df. Note that we used * as it can take any number of positional arguments and returns their product. The ByRow wrapper signals that we want to perform this operation row-wise.

In comparison to the solution shown before many users find presentation of a full factorial design easier to work with.

What if we have a fractional factorial design?

Sometimes your data is incomplete, and some level combinations are missing. Let us start by creating such a data frame:

julia> df2 = df[Not(2, 5, 9), :]
9×4 DataFrame
 Row │ height  width  depth  volume
     │ Int64   Int64  Int64  Int64
─────┼──────────────────────────────
   1 │     10      8      5     400
   2 │     10     10      5     500
   3 │     12     10      5     600
   4 │     12     15      5     900
   5 │     10      8      6     480
   6 │     12      8      6     576
   7 │     12     10      6     720
   8 │     10     15      6     900
   9 │     12     15      6    1080

Now one might ask to complete this design and re-fill the design to be complete. This can be done by the fillcombinations function. Let us see it at work:

julia> fillcombinations(df2, Not("volume"))
12×4 DataFrame
 Row │ height  width  depth  volume
     │ Int64   Int64  Int64  Int64?
─────┼───────────────────────────────
   1 │     10      8      5      400
   2 │     12      8      5  missing
   3 │     10     10      5      500
   4 │     12     10      5      600
   5 │     10     15      5  missing
   6 │     12     15      5      900
   7 │     10      8      6      480
   8 │     12      8      6      576
   9 │     10     10      6  missing
  10 │     12     10      6      720
  11 │     10     15      6      900
  12 │     12     15      6     1080

Observe that after calling this function we have created a new data frame with the missing rows added. The "volume" column is filled by default with missing for rows that were added. The Not("volume") argument meant that we want to get all combinations of values in all columns except "volume".

Conclusions

Today we worked with two functions: allcombinations and fillcombinations. You will find them useful if in your work you will ever need to create all combinations of levels of some factors. This functionality seems niche, but it is needed in practice surprisingly often.