The order of join and grouping result in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/04/30/roworder.html

Introduction

Today I want to focus on an issue that is often not noticed by users when
working DataFrames.jl, but in some cases it it might be relevant.

The subject is the order of join and grouping operations result in
DataFrames.jl. The key point of the post is that this order depends on several
factors, so it is simplest to assume that it is undefined.
I am not going to list all cases in my examples, but just focus on showing
the fact as in the future the order might change.

In this post I am using Julia 1.6.1 and DataFrames.jl 1.0.1.

Joins

Consider the following example of innerjoin:

julia> using DataFrames

julia> df1 = DataFrame(x=[2, 3, 1, 4], id1=1:4)
4×2 DataFrame
 Row │ x      id1
     │ Int64  Int64
─────┼──────────────
   1 │     2      1
   2 │     3      2
   3 │     1      3
   4 │     4      4

julia> df2 = DataFrame(x=[1, 3, 2, 5, 6], id2=1:5)
5×2 DataFrame
 Row │ x      id2
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     3      2
   3 │     2      3
   4 │     5      4
   5 │     6      5

julia> innerjoin(df1, df2, on=:x)
3×3 DataFrame
 Row │ x      id1    id2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      3      1
   2 │     3      2      2
   3 │     2      1      3

julia> innerjoin(df2, df1, on=:x)
3×3 DataFrame
 Row │ x      id2    id1
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1      3
   2 │     3      2      2
   3 │     2      3      1

As you can see currently the row order in the result of innerjoin is taken
from the longer table.

Now consider outerjoin (similar results are for leftjoin and rightjoin):

julia> outerjoin(df1, df2, on=:x)
6×3 DataFrame
 Row │ x      id1      id2
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     1        3        1
   2 │     3        2        2
   3 │     2        1        3
   4 │     4        4  missing
   5 │     5  missing        4
   6 │     6  missing        5

julia> outerjoin(df2, df1, on=:x)
6×3 DataFrame
 Row │ x      id2      id1
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     1        1        3
   2 │     3        2        2
   3 │     2        3        1
   4 │     5        4  missing
   5 │     6        5  missing
   6 │     4  missing        4

Now we have the following parts of the table: first comes the chunk
matching what innerjoin produces, then we have a non-matching part from the
left table, and finally we have a non-matching part from the right table.

While before 1.0 release we did not guarantee the row order in joins, the
actual order has changed in DataFrames.jl 1.0. The reason were performance
considerations. Consider the following examples of joins and their timing:

julia> df1 = DataFrame(x=string.(1:10^7));

julia> df2 = DataFrame(x=string.(1:10));

julia> @time innerjoin(df1, df2, on=:x);
  0.246627 seconds (176 allocations: 13.797 KiB)

julia> @time innerjoin(df2, df1, on=:x);
  0.237981 seconds (175 allocations: 13.781 KiB)

(I am showing you the timings after compilationp; I use Vector{String} to join
on as this case is the slowest scenario under DataFrames.jl 1.0).

Now switch to DataFrames.jl 0.22.7 for a while (you need a fresh session and a
fresh project environment to test this; timings are again after compilation):

julia> df1 = DataFrame(x=string.(1:10^7));

julia> df2 = DataFrame(x=string.(1:10));

julia> @time innerjoin(df1, df2, on=:x);
  0.350317 seconds (177 allocations: 152.602 MiB)

julia> @time innerjoin(df2, df1, on=:x);
  1.140921 seconds (183 allocations: 662.071 MiB)

As you can see the current algorithm not only uses much less memory, but also
it is faster in general and not affected by the argument order (the last thing
was a major bane of joins before 1.0 release of DataFrames.jl).

For a reference check what data.table in R offers in this case in terms of
performance (I am adding it as the performance against data.table is a hot
topic recently):

> library(data.table)
data.table 1.14.0 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com
> dt1 <- data.table(x=as.character(1:10^7))
> dt2 <- data.table(x=as.character(1:10))
> system.time(merge(dt1, dt2, all=FALSE))
   user  system elapsed
  7.445   0.153   3.544
> system.time(merge(dt1, dt2, all=FALSE, sort=FALSE))
   user  system elapsed
  6.735   0.128   2.827

(note that I have used non-pooled vectors in both cases, as this was the scenario
that allowed me to compare DataFrames.jl 1.0 and 0.22.7 best; clearly if we
joined on pooled vectors the timings would be much better)

Grouping

In groupby operation the rules of ordering of the GroupedDataFrame object
depend on the type of the column you join on (I am assuming you are not passing
sort=true keyword argument, as then groups are sorted). The two cases are:

  • if you join on columns that are pooled (like PooledVector or
    CategoricalVector) and the number of possible groups is not huge
    then you get your result in the order of levels in the pool;
  • otherwise the group ordering is their order of appearance in the source vector.

Here a particular corner case are integer columns, which are treated to be
pooled (so the groups are sorted), unless the range of the integers is huge
(as then we fall back to the order of appearance). Here is an example:

julia> df = DataFrame(x=[3, 1, 2], y=[300, 1, 2])
3×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     3    300
   2 │     1      1
   3 │     2      2

julia> keys(groupby(df, :x))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (x = 1,)
 GroupKey: (x = 2,)
 GroupKey: (x = 3,)

julia> keys(groupby(df, :y))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (y = 300,)
 GroupKey: (y = 1,)
 GroupKey: (y = 2,)

What is considered to be huge is left undefined (as it might change in the
future), but in general if there is less levels than the number of rows in of
the data frame (this is a typical case in practice) then we do not consider
it as a huge number.

Again – let us do some benchmarking against the 0.22.7 release of DataFrames.jl.
First the results for DataFrames.jl 1.0:

julia> df = DataFrame(x=1:10^7+1, y=[1:10^7; 10^10]);

julia> @time groupby(df, :x);
  0.055407 seconds (64 allocations: 85.834 MiB)

julia> @time groupby(df, :y);
  0.895716 seconds (50 allocations: 280.591 MiB)

and now under 0.22.7 release:

julia> df = DataFrame(x=1:10^7+1, y=[1:10^7; 10^10]);

julia> @time groupby(df, :x);
  0.890674 seconds (31 allocations: 280.590 MiB)

julia> @time groupby(df, :y);
  0.884177 seconds (31 allocations: 280.590 MiB)

As you can see, in the case of grouping integer columns we are much faster
than before if the integer range is not huge.

Let us have a comparison with data.table again (we need to also perform some
aggregation to match apples to apples in terms of timing).

First DataFrames.jl 1.0:

julia> df = DataFrame(x=1:10^7+1, y=[1:10^7; 10^10]);

julia> @time combine(groupby(df, :x), nrow);
  0.180592 seconds (261 allocations: 324.266 MiB)

julia> @time combine(groupby(df, :y), nrow);
  1.006619 seconds (247 allocations: 519.023 MiB)
> df <- data.table(x=1:(10^7+1), y=c(1:10^7, 10^10))
> system.time(df[, .N, by = x])
   user  system elapsed
  0.644   0.088   0.266
> system.time(df[, .N, by = y])
   user  system elapsed
  0.991   0.096   0.404

This time for the huge range DataFrames.jl is slower. (note that data.table
is using four threads – which is great – and I tested my code on a single thread
in Julia, as in DataFrames.jl we do not support multi-threading in this
particular case yet)

Conclusions

In summary: although there are precise rules that determine the order of join
and grouping results is simplest to assume that it is undefined (like in data
bases). The reason for this are operation performance considerations (so the
rules are complex and might change in the future).

However, based on the user feedback, we might in the future consider adding some
keyword arguments to joins or groupby that would guarantee some particular
order. Therefore if you have any thoughts on it please open an issue in
DataFrames.jl repository on GitHub.