Sorting data with missing values

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/04/12/sorting.html

Introduction

Sorting is one of the most common operations one wants to do with collections.
In this post I discuss how one can sort data that contain missing values.

The post was written under Julia 1.10.1 and Missings.jl 1.2.0.

General rules of comparison with missing values

By default missing is considered as greater than any other different value it is compared with:

julia> isless(Inf, missing)
true

julia> isless("abc", missing)
true

julia> isless(r"abc", missing)
true

Note, in particular, the last case. Although Regex does not support comparisons it can be compared to missing.
The reason is that isless has a general catch-all definition when one of the arguments is missing. Let us see it:

isless(::Missing, ::Missing) = false
isless(::Missing, ::Any) = false
isless(::Any, ::Missing) = true

The rule that missing is greater than all else has an important consequence when sorting.

Default sorting with missing values

Let us create a simple vector containing missing values:

julia> x = [missing, 3, 1, missing, 2, 4, missing]
7-element Vector{Union{Missing, Int64}}:
  missing
 3
 1
  missing
 2
 4
  missing

If we sort it missing values end up at the end of the produced vector
because, by default, sorting is done in ascending order:

julia> sort(x)
7-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
  missing
  missing
  missing

If we want to get values in descending order missing values come first:

julia> sort(x, rev=true)
7-element Vector{Union{Missing, Int64}}:
  missing
  missing
  missing
 4
 3
 2
 1

But what if we wanted to have values sorted in descending order, but put missing at the end?

Supplementary sorting order

Users often wanted a functionality that would allow them to sort values, but treat missing
as the smallest. This means that if you sort your data in a descending order missing would be put at the end.
Similarly, if you want to sort your data in ascending order missing would be put at the beginning.

With Missings.jl release 1.2 this functionality is supported with the missingsmallest function:

julia> sort(x, lt=missingsmallest)
7-element Vector{Union{Missing, Int64}}:
  missing
  missing
  missing
 1
 2
 3
 4

julia> sort(x, lt=missingsmallest, rev=true)
7-element Vector{Union{Missing, Int64}}:
 4
 3
 2
 1
  missing
  missing
  missing

By default missingsmallest uses the isless comparison.

More advanced cases of treating missing as smallest

Assume that you have the following vector that you want to sort by
the length of the string:

julia> s = [missing, "abc", "x", missing, "bcde", "pq", missing]
7-element Vector{Union{Missing, String}}:
 missing
 "abc"
 "x"
 missing
 "bcde"
 "pq"
 missing

If you try a simple way to do it you get an error:

julia> sort(s, by=length)
ERROR: MethodError: no method matching length(::Missing)

We need to wrap length in passmissing to get what we want:

julia> sort(s, by=passmissing(length))
7-element Vector{Union{Missing, String}}:
 "x"
 "pq"
 "abc"
 "bcde"
 missing
 missing
 missing

julia> sort(s, by=passmissing(length), rev=true)
7-element Vector{Union{Missing, String}}:
 missing
 missing
 missing
 "bcde"
 "abc"
 "pq"
 "x"

But what if we wanted to treat missing values as smallest?

The first approach is the one we already know:

julia> sort(s, by=passmissing(length), lt=missingsmallest)
7-element Vector{Union{Missing, String}}:
 missing
 missing
 missing
 "x"
 "pq"
 "abc"
 "bcde"

julia> sort(s, by=passmissing(length), lt=missingsmallest, rev=true)
7-element Vector{Union{Missing, String}}:
 "bcde"
 "abc"
 "pq"
 "x"
 missing
 missing
 missing

However, there is an alternative. You can define a comparison function that works on strings:

julia> isshorter(s1::AbstractString, s2::AbstractString) = length(s1) < length(s2)
isshorter (generic function with 1 method)

Then you can pass the isshorter function to missingsmallest
as a single argument to generate a comparison function
that automatically treats missing values as smallest:

julia> sort(s, lt=missingsmallest(isshorter))
7-element Vector{Union{Missing, String}}:
 missing
 missing
 missing
 "x"
 "pq"
 "abc"
 "bcde"

julia> sort(s, lt=missingsmallest(isshorter), rev=true)
7-element Vector{Union{Missing, String}}:
 "bcde"
 "abc"
 "pq"
 "x"
 missing
 missing
 missing

Conclusions

The missingsmallest functionality was added in Missings.jl 1.2.
I hope you will find it useful when working with your data!

From FASTQ to CNV calls in Julia

By: Blogs on jonathanBieler.github.io

Re-posted from: https://jonathanbieler.github.io/blog/fastq2cnv/

I’ve recently worked on Julia bindings for BWA, allowing to align sequencing reads (FASTA or FASTQ) to a reference genome directly from Julia. On top of this, updates to XAM.jl and BioRecordsProcessing.jl will make it possible to build end-to-end workflows in Julia, from raw reads to analysis results.
To illustrate this I will showcase online whole-genome sequencing (WGS) copy number variation (CNV) calling on a mouse kidney tumor sample.
Briefly, we need to do the following :

Deduplication of rows in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/04/05/duplicates.html

Introduction

Deduplication of rows in a table is one of the basic functionalities that
is often needed when working with data frames. Today I discuss the
allunique, nonunique, unique, and unique! functions that
are provided by DataFrames.jl and can help you with this task.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Checking if a data frame has duplicate rows

Let us start with discussing how one can check if a data frame has duplicate rows
as this is the simplest check and the functionalities that we discuss here
carry-over to other functions that we discuss later.

First create a simple data frame:

julia> using DataFrames

julia> df = DataFrame(x=1:6, y=[1.0, 2.0, 1.0, 2.0, 0.0, -0.0])
6×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     3      1.0
   4 │     4      2.0
   5 │     5      0.0
   6 │     6     -0.0

By just calling the allunique function we can check if whole rows of this data frame are unique:

julia> allunique(df)
true

In this case we get true as indeed all rows are unique. It is guaranteed by the column "x" which holds
consecutive integers.

However, we can pass a second positional argument to allunique. In this case we can narrow down the list of
checked columns:

julia> allunique(df, "y")
false

Here we checked uniqueness of only column "y", which contains duplicates, e.g. row 1 and row 3 contain the same value 1.0,
so we got false.

But this is not all. The second positional argument can be any transformation that is supported by the select function.
Therefore, for example, we can run:

julia> allunique(df, "x" => ByRow(iseven))
false

We got false, as applying the iseven to the x column creates duplicates since we have multiple even and odd values in it.
But e.g. we have:

julia> allunique(df, "x" => ByRow(x -> x^2))
true

Now we get true as squares of consecutive integers are unique.

We can pass several transformations as well:

julia> allunique(df, ["x" => ByRow(x -> mod(x, 3)), "y" => identity])
true

To convince ourselves that the true result is correct let us run the select operation with the same argument:

julia> select(df, ["x" => ByRow(x -> mod(x, 3)), "y" => identity])
6×2 DataFrame
 Row │ x_function  y_identity
     │ Int64       Float64
─────┼────────────────────────
   1 │          1         1.0
   2 │          2         2.0
   3 │          0         1.0
   4 │          1         2.0
   5 │          2         0.0
   6 │          0        -0.0

Indeed the rows produced by this operation are unique.

Finding duplicate rows

To get a vector with indicators of duplicate rows in a data frame use the nonunique function. Here are three examples of its usage
(note it also can take a second positional argument just like allunique):

julia> nonunique(df)
6-element Vector{Bool}:
 0
 0
 0
 0
 0
 0

All rows are unique in df, as we already know, so we got a vector of falses in the call above.

Now the second example:

julia> nonunique(df, "x" => ByRow(iseven))
6-element Vector{Bool}:
 0
 0
 1
 1
 1
 1

Here we see that we get true for all rows for which there was already a duplicate row before. So first two rows get false (non-duplicated)
and the following rows have the true indicator (as we have already seen an even and an odd number in column "x").

Now look at the last example:

julia> nonunique(df, "y")
6-element Vector{Bool}:
 0
 0
 1
 1
 0
 0

You might be surprised by the last false. The reason is that all the de-duplication functions use isequal to compare values for equality,
and 0.0 is not equal to -0.0 in this comparison:

julia> isequal(0.0, -0.0)
false

This behavior matches the way how dictionaries work in Julia.

Additionally the nonunique has a keep keyword argument. It allows us to change the default behavior which rows are marked as duplicate.
If we pass keep=:last then the last of the duplicated rows is marked as unique. See for example:

julia> nonunique(df, "x" => ByRow(iseven); keep=:last)
6-element Vector{Bool}:
 1
 1
 1
 1
 0
 0

We get false in last two rows as 5 and 6 are last even and odd numbers respectively.

The third option is keep=:noduplicates in which case only rows that have no duplicates are marked as unique. So we have:

julia> nonunique(df, "x" => ByRow(iseven); keep=:noduplicates)
6-element Vector{Bool}:
 1
 1
 1
 1
 1
 1

as no row was truly unique, but we have:

julia> nonunique(df, "y"; keep=:noduplicates)
6-element Vector{Bool}:
 1
 1
 1
 1
 0
 0

as first four rows were duplicated, but rows with 0.0 and -0.0 are indeed unique.

Removing duplicate rows from a data frame

The nonunique function returns a vector of duplicate indicators. Often we just want to get rid of them from our data frame.
The unique and unique! functions can be used to perform this operation. They support the same arguments as nonunique.
You have three options how you cen get your result:

  • using unique you get a new data frame by default;
  • using unique with view=true keyword argument passed you get a view of the source data frame with duplicates removed;
  • using unique! you drop the duplicates in-place from the source data frame.

Let us see how it works. First plain unique:

julia> unique(df, "y")
4×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     5      0.0
   4 │     6     -0.0

We got a new data frame. The df data frame is unchanged. The second option is a view:

julia> unique(df, "y"; view=true)
4×2 SubDataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     5      0.0
   4 │     6     -0.0

Note that still df is untouched:

julia> df
6×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     3      1.0
   4 │     4      2.0
   5 │     5      0.0
   6 │     6     -0.0

And finally we can change the df data frame in place:

julia> unique!(df, "y")
4×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     5      0.0
   4 │     6     -0.0

julia> df
4×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      2.0
   3 │     5      0.0
   4 │     6     -0.0

In this case, as you can see, the df data frame was updated.

Conclusions

I hope that you will find this review of the functionalities of the
allunique, nonunique, unique, and unique! functions useful.

As a summary remember that:

  • You can determine uniqueness of rows based on transformations of data contained in the source data frame.
  • You can decide which rows are marked as duplicate using the keep keyword argument.