Sorting data by a transformation of columns in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/03/12/sorting.html

Introduction

Recently I have several times received a question how to sort a data frame
based on some transformation of its columns. In this post I will show you
the way it can be currently done.

The post was written under Julia 1.6.0-rc1 and DataFrames 0.22.5.

Before we start create a data frame we will work with. It contains
coordinates of some points in a three-dimensional space.

julia> using DataFrames

julia> using LinearAlgebra

julia> using Random

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> df = DataFrame(x=rand(10), y=rand(10), z=rand(10)) .- 0.5
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │  0.0908446   0.148882   0.450498
   2 │  0.266797   -0.489094   0.46467
   3 │  0.0662374  -0.433577   0.445775
   4 │ -0.0399147   0.456753   0.289904
   5 │  0.294026    0.146691   0.32116
   6 │  0.354147   -0.387514  -0.46584
   7 │ -0.299414   -0.223979  -0.405456
   8 │ -0.201386    0.151664  -0.185074
   9 │ -0.253163   -0.443358  -0.37219
  10 │  0.0796722   0.342714  -0.125813

Basic sorting

You can sort this data frame by column :y like this:

julia> sort(df, :y)
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │  0.266797   -0.489094   0.46467
   2 │ -0.253163   -0.443358  -0.37219
   3 │  0.0662374  -0.433577   0.445775
   4 │  0.354147   -0.387514  -0.46584
   5 │ -0.299414   -0.223979  -0.405456
   6 │  0.294026    0.146691   0.32116
   7 │  0.0908446   0.148882   0.450498
   8 │ -0.201386    0.151664  -0.185074
   9 │  0.0796722   0.342714  -0.125813
  10 │ -0.0399147   0.456753   0.289904

If you want to sort it in reverse just do:

julia> sort(df, :y, rev=true)
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │ -0.0399147   0.456753   0.289904
   2 │  0.0796722   0.342714  -0.125813
   3 │ -0.201386    0.151664  -0.185074
   4 │  0.0908446   0.148882   0.450498
   5 │  0.294026    0.146691   0.32116
   6 │ -0.299414   -0.223979  -0.405456
   7 │  0.354147   -0.387514  -0.46584
   8 │  0.0662374  -0.433577   0.445775
   9 │ -0.253163   -0.443358  -0.37219
  10 │  0.266797   -0.489094   0.46467

or

julia> sort(df, order(:y, rev=true))
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │ -0.0399147   0.456753   0.289904
   2 │  0.0796722   0.342714  -0.125813
   3 │ -0.201386    0.151664  -0.185074
   4 │  0.0908446   0.148882   0.450498
   5 │  0.294026    0.146691   0.32116
   6 │ -0.299414   -0.223979  -0.405456
   7 │  0.354147   -0.387514  -0.46584
   8 │  0.0662374  -0.433577   0.445775
   9 │ -0.253163   -0.443358  -0.37219
  10 │  0.266797   -0.489094   0.46467

Using order is useful if you would want to sort a data frame by several columns
and apply different ordering rules to them.

If you want to apply a transformation to a single column and sort it based on
the transformed values use the by option:

julia> sort(df, :y, by=abs)
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │  0.294026    0.146691   0.32116
   2 │  0.0908446   0.148882   0.450498
   3 │ -0.201386    0.151664  -0.185074
   4 │ -0.299414   -0.223979  -0.405456
   5 │  0.0796722   0.342714  -0.125813
   6 │  0.354147   -0.387514  -0.46584
   7 │  0.0662374  -0.433577   0.445775
   8 │ -0.253163   -0.443358  -0.37219
   9 │ -0.0399147   0.456753   0.289904
  10 │  0.266797   -0.489094   0.46467

or equivalently

julia> sort(df, order(:y, by=abs))
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │  0.294026    0.146691   0.32116
   2 │  0.0908446   0.148882   0.450498
   3 │ -0.201386    0.151664  -0.185074
   4 │ -0.299414   -0.223979  -0.405456
   5 │  0.0796722   0.342714  -0.125813
   6 │  0.354147   -0.387514  -0.46584
   7 │  0.0662374  -0.433577   0.445775
   8 │ -0.253163   -0.443358  -0.37219
   9 │ -0.0399147   0.456753   0.289904
  10 │  0.266797   -0.489094   0.46467

These patterns naturally extend to multiple columns, and sorting is performed
lexicographically. Here is an example:

julia> df2 = DataFrame(x=rand(Bool, 16), y=rand(Bool, 16), z=rand(Bool, 16))
16×3 DataFrame
 Row │ x      y      z
     │ Bool   Bool   Bool
─────┼─────────────────────
   1 │ false   true  false
   2 │  true   true   true
   3 │  true  false   true
   4 │ false  false   true
   5 │  true  false   true
   6 │  true  false  false
   7 │  true  false  false
   8 │ false  false  false
   9 │ false  false   true
  10 │ false   true   true
  11 │  true  false   true
  12 │ false  false  false
  13 │  true  false  false
  14 │  true  false  false
  15 │ false  false  false
  16 │ false   true  false

julia> sort(df2, [:y, order(:z, rev=true), :x])
16×3 DataFrame
 Row │ x      y      z
     │ Bool   Bool   Bool
─────┼─────────────────────
   1 │ false  false   true
   2 │ false  false   true
   3 │  true  false   true
   4 │  true  false   true
   5 │  true  false   true
   6 │ false  false  false
   7 │ false  false  false
   8 │ false  false  false
   9 │  true  false  false
  10 │  true  false  false
  11 │  true  false  false
  12 │  true  false  false
  13 │ false   true   true
  14 │  true   true   true
  15 │ false   true  false
  16 │ false   true  false

However, a question is what if I want to sort a data frame on a function of
multiple columns taken together?

Sorting on multiple columns considered jointly

Going back to our df data frame what if we wanted to sort it by the distance
from the origin?

In this case the sortperm function is useful. What you need to
do is to create a temporary object, get its sortperm, and apply it to a
source data frame. Here is how it is done in practice:

julia> df[sortperm(norm.(eachrow(df))), :]
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │ -0.201386    0.151664  -0.185074
   2 │  0.0796722   0.342714  -0.125813
   3 │  0.294026    0.146691   0.32116
   4 │  0.0908446   0.148882   0.450498
   5 │ -0.0399147   0.456753   0.289904
   6 │ -0.299414   -0.223979  -0.405456
   7 │  0.0662374  -0.433577   0.445775
   8 │ -0.253163   -0.443358  -0.37219
   9 │  0.354147   -0.387514  -0.46584
  10 │  0.266797   -0.489094   0.46467

A nice thing is that sortperm also works for data frames, so if you wanted to
sort the data frame by the sign of :x and then by the sum of :y and :z
columns you could write:

julia> df[sortperm(select(df, :x => ByRow(sign), [:y, :z] => +)), :]
10×3 DataFrame
 Row │ x           y          z
     │ Float64     Float64    Float64
─────┼──────────────────────────────────
   1 │ -0.253163   -0.443358  -0.37219
   2 │ -0.299414   -0.223979  -0.405456
   3 │ -0.201386    0.151664  -0.185074
   4 │ -0.0399147   0.456753   0.289904
   5 │  0.354147   -0.387514  -0.46584
   6 │  0.266797   -0.489094   0.46467
   7 │  0.0662374  -0.433577   0.445775
   8 │  0.0796722   0.342714  -0.125813
   9 │  0.294026    0.146691   0.32116
  10 │  0.0908446   0.148882   0.450498

Conclusion

The thing to remember is that because data frame fully supports standard
indexing like a matrix you can easily reorder it using the sortperm function
applied to an object different than the original data frame.

However, since this feature request is raised quite often we are currently
discussing how to add a support to it in a standard sort syntax. If
you are interested in the details you can check this issue.