How to make your joins faster in DataFrames.jl?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/07/30/joins.html

Introduction

Joining tables is one of the fundamental operations when wrangling data.
Therefore people using DataFrames.jl would understandably want them
to be as fast as possible.

In this post I will show you some tricks that can make joins faster in case
that is currently a bottleneck, which is when you have many Strings in your
data. This scenario is relevant as it is quite common in data science workflows.

This post was written under Julia 1.6.1, DataFrames.jl 1.2.1, BenchmarkTools.jl
1.1.1, and WeakRefStrings.jl 1.1.0. I am running the tests on a laptop
under Linux and with 16GB of RAM using a single thread.

In the timings below I report the memory footprint of julia process in
respective scenarios. This is relevant, as when we would be very close to
available memory limit Julia might struggle with memory management. I have
chosen the size of the tests that they easily fit into memory (but of course
they are large enough to cause issues).

The baseline

In the baseline scenario I join tables that do not contain any strings.

Here is my benchmark:

julia> using DataFrames

julia> using BenchmarkTools

julia> df1 = DataFrame(id=1:5*10^7, left=1:5*10^7)
50000000×2 DataFrame
      Row │ id        left
          │ Int64     Int64
──────────┼────────────────────
        1 │        1         1
        2 │        2         2
    ⋮     │    ⋮         ⋮
 49999999 │ 49999999  49999999
 50000000 │ 50000000  50000000
          49999996 rows omitted

julia> df2 = DataFrame(id=1:5*10^7, right=1:5*10^7)
50000000×2 DataFrame
      Row │ id        right
          │ Int64     Int64
──────────┼────────────────────
        1 │        1         1
        2 │        2         2
    ⋮     │    ⋮         ⋮
 49999999 │ 49999999  49999999
 50000000 │ 50000000  50000000
          49999996 rows omitted

julia> GC.gc() # julia process memory footprint: 1.7GB

julia> @benchmark innerjoin($df1, $df2, on=:id) seconds=60
BenchmarkTools.Trial: 28 samples with 1 evaluation.
 Range (min … max):  1.963 s …   2.255 s  ┊ GC (min … max): 0.02% … 9.21%
 Time  (median):     2.190 s              ┊ GC (median):    9.49%
 Time  (mean ± σ):   2.188 s ± 47.369 ms  ┊ GC (mean ± σ):  9.12% ± 1.79%

                                             █ ▆
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁███▅▄▁▁▁▁▁▁▁▁▅ ▁
  1.96 s         Histogram: frequency by time        2.26 s <

 Memory estimate: 1.86 GiB, allocs estimate: 198.

As you can see the baseline timing is around 2 seconds. The GC
(garbage collector) gets triggered in the benchmarks but it does not take more
than 10% of total run time.

This gives us a baseline for our tests.

The issue of many small allocated objects

Now consider that :left column in df1 and :right column in df2 are
instead containing elements of type String. Note that this will not affect
the joining process as I do not touch the column on which we perform the join.

Here is what we get:

julia> df1.left = string.(df1.left);

julia> df2.right = string.(df2.right);

julia> GC.gc() # julia process memory footprint: 4.7GB

julia> @benchmark innerjoin($df1, $df2, on=:id) seconds=60
BenchmarkTools.Trial: 9 samples with 1 evaluation.
 Range (min … max):  4.670 s … 8.088 s  ┊ GC (min … max): 54.57% … 73.43%
 Time  (median):     7.764 s            ┊ GC (median):    73.47%
 Time  (mean ± σ):   7.457 s ± 1.051 s  ┊ GC (mean ± σ):  72.17% ±  6.31%

                                                   █
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▄ ▁
  4.67 s        Histogram: frequency by time        890 s <

 Memory estimate: 1.86 GiB, allocs estimate: 198.

As you can see now we are roughly 4 times slower on the average. In this case GC
typically takes around a bit over 70% of total run time of join operation. We
observe this issue although the memory footprint of julia process is 4.7GB,
which is still well below the memory I have available on my machine.

The issue is that GC, if triggered, takes very long time because it has to
traverse very many small allocated objects (Strings in our case).

We try inlining our strings

Being aware of the issue @quinnj has implemented a set of InlineString*
types in the WeakRefStrings.jl package (and soon CSV.jl will use
them by default).

Let us check if using them resolves our issues:

julia> using WeakRefStrings

julia> df1.left = InlineString15.(df1.left);

julia> df2.right = InlineString15.(df2.right);

julia> GC.gc()  # julia process memory footprint: 2.5GB

julia> @benchmark innerjoin($df1, $df2, on=:id) seconds=60
BenchmarkTools.Trial: 24 samples with 1 evaluation.
 Range (min … max):  2.258 s …   2.544 s  ┊ GC (min … max):  0.03% … 10.36%
 Time  (median):     2.534 s              ┊ GC (median):    10.27%
 Time  (mean ± σ):   2.521 s ± 56.830 ms  ┊ GC (mean ± σ):   9.83% ±  2.11%

                                                       ▁▃█▃
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁████ ▁
  2.26 s         Histogram: frequency by time        2.54 s <

 Memory estimate: 2.61 GiB, allocs estimate: 198.

Wow! We are almost back to the timings with the integer columns we had
originally. The memory footprint and the timings are a bit bigger than in the
baseline scenario because the width of the strings in our case requires us to
use InlineString15 type.

Does using Symbol solve the issue?

One of the alternative practices that people use to work around the issues with
Strings is to use Symbols instead. Since Symbols are interned they are
faster for certain operations. This is not a free lunch though as, in particular
the memory used by them cannot be reclaimed when they are no longer referenced to.

Let us check if using Symbol instead of String helps in our case:

julia> df1.left = Symbol.(df1.left);

julia> df2.right = Symbol.(df2.right);

julia> GC.gc() # julia process memory footprint: 4.0GB

julia> @benchmark innerjoin($df1, $df2, on=:id) seconds=60
BenchmarkTools.Trial: 16 samples with 1 evaluation.
 Range (min … max):  1.879 s …    4.831 s  ┊ GC (min … max):  0.04% … 59.90%
 Time  (median):     3.940 s               ┊ GC (median):    51.65%
 Time  (mean ± σ):   3.868 s ± 574.918 ms  ┊ GC (mean ± σ):  50.72% ± 13.20%

                                          █
  ▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  1.88 s         Histogram: frequency by time         4.83 s <

 Memory estimate: 1.86 GiB, allocs estimate: 198.

As you can see the benchmarks are better than for String. However, still
GC time is typically taking 50% of execution time. This shows that although
Symbols are interned they are tracked by the GC (which in theory could be
avoided since we know that the memory they occupy cannot be reclaimed).

Conclusions

Here is a summary of what we have learned:

  • using String type can lead to significant issues with GC;
  • using Symbol instead is somewhat better, but does not resolve the GC issues
    in full;
  • using InlineString15 gave very good results.

One just has to remember that InlineString* types are limited and can hold
small strings only. Fortunately in typical scenarios when we have a lot of
strings they are not super long.

Also note that I use String type as an example in our tests because it is
common in data science workflows. However, one would run to similar issues if
one would use many objects that have to be tracked by GC. In fact it does not
even matter if they would be processed by e.g. join operation that we used in
our examples. What matters is that they reside in memory and thus need to be
tracked by GC.

Let me also highlight that the issues would be even more visible if I used
multiple threads. The reason is that currently the GC in Julia does not handle
this scenario as efficiently as it potentially could. This is an issue that is
planned to be resolved by the Julia core devs, as we could learn during
this talk given at JuliaCon2021.

I believe in the future it will be possible to make everything to be efficient
out-of-the-box. However, till the issues with GC when many small objects need
to be tracked are present it is good to know how they can be resolved.