Re-posted from: https://bkamins.github.io/julialang/2021/07/30/joins.html
Introduction
Joining tables is one of the fundamental operations when wrangling data.
Therefore people using DataFrames.jl would understandably want them
to be as fast as possible.
In this post I will show you some tricks that can make joins faster in case
that is currently a bottleneck, which is when you have many String
s in your
data. This scenario is relevant as it is quite common in data science workflows.
This post was written under Julia 1.6.1, DataFrames.jl 1.2.1, BenchmarkTools.jl
1.1.1, and WeakRefStrings.jl 1.1.0. I am running the tests on a laptop
under Linux and with 16GB of RAM using a single thread.
In the timings below I report the memory footprint of julia
process in
respective scenarios. This is relevant, as when we would be very close to
available memory limit Julia might struggle with memory management. I have
chosen the size of the tests that they easily fit into memory (but of course
they are large enough to cause issues).
The baseline
In the baseline scenario I join tables that do not contain any strings.
Here is my benchmark:
julia> using DataFrames
julia> using BenchmarkTools
julia> df1 = DataFrame(id=1:5*10^7, left=1:5*10^7)
50000000×2 DataFrame
Row │ id left
│ Int64 Int64
──────────┼────────────────────
1 │ 1 1
2 │ 2 2
⋮ │ ⋮ ⋮
49999999 │ 49999999 49999999
50000000 │ 50000000 50000000
49999996 rows omitted
julia> df2 = DataFrame(id=1:5*10^7, right=1:5*10^7)
50000000×2 DataFrame
Row │ id right
│ Int64 Int64
──────────┼────────────────────
1 │ 1 1
2 │ 2 2
⋮ │ ⋮ ⋮
49999999 │ 49999999 49999999
50000000 │ 50000000 50000000
49999996 rows omitted
julia> GC.gc() # julia process memory footprint: 1.7GB
julia> @benchmark innerjoin($df1, $df2, on=:id) seconds=60
BenchmarkTools.Trial: 28 samples with 1 evaluation.
Range (min … max): 1.963 s … 2.255 s ┊ GC (min … max): 0.02% … 9.21%
Time (median): 2.190 s ┊ GC (median): 9.49%
Time (mean ± σ): 2.188 s ± 47.369 ms ┊ GC (mean ± σ): 9.12% ± 1.79%
█ ▆
▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁███▅▄▁▁▁▁▁▁▁▁▅ ▁
1.96 s Histogram: frequency by time 2.26 s <
Memory estimate: 1.86 GiB, allocs estimate: 198.
As you can see the baseline timing is around 2 seconds. The GC
(garbage collector) gets triggered in the benchmarks but it does not take more
than 10% of total run time.
This gives us a baseline for our tests.
The issue of many small allocated objects
Now consider that :left
column in df1
and :right
column in df2
are
instead containing elements of type String
. Note that this will not affect
the joining process as I do not touch the column on which we perform the join.
Here is what we get:
julia> df1.left = string.(df1.left);
julia> df2.right = string.(df2.right);
julia> GC.gc() # julia process memory footprint: 4.7GB
julia> @benchmark innerjoin($df1, $df2, on=:id) seconds=60
BenchmarkTools.Trial: 9 samples with 1 evaluation.
Range (min … max): 4.670 s … 8.088 s ┊ GC (min … max): 54.57% … 73.43%
Time (median): 7.764 s ┊ GC (median): 73.47%
Time (mean ± σ): 7.457 s ± 1.051 s ┊ GC (mean ± σ): 72.17% ± 6.31%
█
▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▄ ▁
4.67 s Histogram: frequency by time 890 s <
Memory estimate: 1.86 GiB, allocs estimate: 198.
As you can see now we are roughly 4 times slower on the average. In this case GC
typically takes around a bit over 70% of total run time of join operation. We
observe this issue although the memory footprint of julia
process is 4.7GB,
which is still well below the memory I have available on my machine.
The issue is that GC, if triggered, takes very long time because it has to
traverse very many small allocated objects (String
s in our case).
We try inlining our strings
Being aware of the issue @quinnj has implemented a set of InlineString*
types in the WeakRefStrings.jl package (and soon CSV.jl will use
them by default).
Let us check if using them resolves our issues:
julia> using WeakRefStrings
julia> df1.left = InlineString15.(df1.left);
julia> df2.right = InlineString15.(df2.right);
julia> GC.gc() # julia process memory footprint: 2.5GB
julia> @benchmark innerjoin($df1, $df2, on=:id) seconds=60
BenchmarkTools.Trial: 24 samples with 1 evaluation.
Range (min … max): 2.258 s … 2.544 s ┊ GC (min … max): 0.03% … 10.36%
Time (median): 2.534 s ┊ GC (median): 10.27%
Time (mean ± σ): 2.521 s ± 56.830 ms ┊ GC (mean ± σ): 9.83% ± 2.11%
▁▃█▃
▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁████ ▁
2.26 s Histogram: frequency by time 2.54 s <
Memory estimate: 2.61 GiB, allocs estimate: 198.
Wow! We are almost back to the timings with the integer columns we had
originally. The memory footprint and the timings are a bit bigger than in the
baseline scenario because the width of the strings in our case requires us to
use InlineString15
type.
Does using Symbol
solve the issue?
One of the alternative practices that people use to work around the issues with
String
s is to use Symbol
s instead. Since Symbol
s are interned they are
faster for certain operations. This is not a free lunch though as, in particular
the memory used by them cannot be reclaimed when they are no longer referenced to.
Let us check if using Symbol
instead of String
helps in our case:
julia> df1.left = Symbol.(df1.left);
julia> df2.right = Symbol.(df2.right);
julia> GC.gc() # julia process memory footprint: 4.0GB
julia> @benchmark innerjoin($df1, $df2, on=:id) seconds=60
BenchmarkTools.Trial: 16 samples with 1 evaluation.
Range (min … max): 1.879 s … 4.831 s ┊ GC (min … max): 0.04% … 59.90%
Time (median): 3.940 s ┊ GC (median): 51.65%
Time (mean ± σ): 3.868 s ± 574.918 ms ┊ GC (mean ± σ): 50.72% ± 13.20%
█
▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
1.88 s Histogram: frequency by time 4.83 s <
Memory estimate: 1.86 GiB, allocs estimate: 198.
As you can see the benchmarks are better than for String
. However, still
GC time is typically taking 50% of execution time. This shows that although
Symbol
s are interned they are tracked by the GC (which in theory could be
avoided since we know that the memory they occupy cannot be reclaimed).
Conclusions
Here is a summary of what we have learned:
- using
String
type can lead to significant issues with GC; - using
Symbol
instead is somewhat better, but does not resolve the GC issues
in full; - using
InlineString15
gave very good results.
One just has to remember that InlineString*
types are limited and can hold
small strings only. Fortunately in typical scenarios when we have a lot of
strings they are not super long.
Also note that I use String
type as an example in our tests because it is
common in data science workflows. However, one would run to similar issues if
one would use many objects that have to be tracked by GC. In fact it does not
even matter if they would be processed by e.g. join operation that we used in
our examples. What matters is that they reside in memory and thus need to be
tracked by GC.
Let me also highlight that the issues would be even more visible if I used
multiple threads. The reason is that currently the GC in Julia does not handle
this scenario as efficiently as it potentially could. This is an issue that is
planned to be resolved by the Julia core devs, as we could learn during
this talk given at JuliaCon2021.
I believe in the future it will be possible to make everything to be efficient
out-of-the-box. However, till the issues with GC when many small objects need
to be tracked are present it is good to know how they can be resolved.