Re-posted from: https://bkamins.github.io/julialang/2021/12/31/leftjoin.html
Introduction
This post continues the presentation of new features added in DataFrames.jl 1.3. This time I will discuss the leftjoin!
function.
The post was written under Julia 1.7.0 and DataFrames.jl 1.3.1.
For performance comparison I have used R 4.1.2 and data.table 0.14.2.
Both in R and Julia I run the computations on 4 threads.
An in place left join
Before release 1.3, DataFrames.jl already offered a rich set of efficient
join functions:
innerjoin
, leftjoin
, rightjoin
, outerjoin
, semijoin
, antijoin
,
and crossjoin
.
However, they all have a common limitation: their result is a freshly allocated
data frame.
A common usage scenario in practice is that we would like to add some
new columns to an existing table in-place. This is more efficient and uses less
memory (which is relevant if we work with very large data frames).
Since DataFrames.jl 1.3 this option is available with the addition of the
leftjoin!
function. if you run leftjoin!(df1, df2; on=...)
then the contract
is that the df1
data frame is updated in-place with columns coming from df2
based on matching rows of both data frames using the columns passed in the on
keyword argument.
It is important to remember that the design of leftjoin!
assumes that the
columns of df1
are left unchanged (this is crucial for performance of the
operation). However, this implies that each row in df1
must have at most one
match in df2
. Otherwise, leftjoin!
would not be able to execute the
operation in-place since new rows would need to be added to df1. If you have
matching duplicate rows in df2
then just use leftjoin
.
Here are two minimal examples of leftjoin!
.
julia> using DataFrames
julia> using Random
julia> df1 = DataFrame(a=1:6, b=1:6)
6×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 4 4
5 │ 5 5
6 │ 6 6
julia> df2 = DataFrame(a=[2, 4, 6], c=1:3)
3×2 DataFrame
Row │ a c
│ Int64 Int64
─────┼──────────────
1 │ 2 1
2 │ 4 2
3 │ 6 3
julia> leftjoin!(df1, df2, on=:a)
6×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64?
─────┼───────────────────────
1 │ 1 1 missing
2 │ 2 2 1
3 │ 3 3 missing
4 │ 4 4 2
5 │ 5 5 missing
6 │ 6 6 3
julia> Random.seed!(1234);
julia> df1 = DataFrame(a=randperm(6), b=1:6)
6×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 3 1
2 │ 2 2
3 │ 6 3
4 │ 5 4
5 │ 1 5
6 │ 4 6
julia> df2 = DataFrame(a=shuffle!([2, 4, 6]), c=1:3)
3×2 DataFrame
Row │ a c
│ Int64 Int64
─────┼──────────────
1 │ 4 1
2 │ 2 2
3 │ 6 3
julia> leftjoin!(df1, df2, on=:a)
6×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64?
─────┼───────────────────────
1 │ 3 1 missing
2 │ 2 2 2
3 │ 6 3 3
4 │ 5 4 missing
5 │ 1 5 missing
6 │ 4 6 1
Performance benchmarks
Now let me run two performance benchmarks of DataFrames.jl against data.table
.
In the benchmarks I use 32-bit integers to ensure comparability of memory
footprint of objects between R and Julia.
The first test is on sorted key column. We start with Julia:
julia> df1 = DataFrame(a=Int32.(1:10^8));
julia> df2 = DataFrame(a=Int32.(1:10^8), x = true);
julia> @time leftjoin!(df1, df2, on=:a);
2.867632 seconds (221.45 k allocations: 2.433 GiB, 6.47% gc time, 3.98% compilation time)
julia> df1 = DataFrame(a=Int32.(1:10^8));
julia> @time leftjoin!(df1, df2, on=:a);
2.934633 seconds (150 allocations: 2.421 GiB, 8.19% gc time)
And now the data.table
:
> library(data.table)
> df1 = data.table(a=1:10^8)
> df2 = data.table(a=1:10^8, x=TRUE)
> system.time(df1[df2, on = 'a', x := i.x])
user system elapsed
8.067 1.106 5.679
> df1 = data.table(a=1:10^8)
> system.time(df1[df2, on = 'a', x := i.x])
user system elapsed
9.305 1.184 6.652
As you can see for sorted data DataFrames.jl timings are competitive. Let us
now check shuffled data.
We start with DataFrames.jl:
julia> df1 = DataFrame(a=shuffle!(Int32.(1:10^8)));
julia> df2 = DataFrame(a=shuffle!(Int32.(1:10^8)), x = true);
julia> @time leftjoin!(df1, df2, on=:a);
23.881552 seconds (175 allocations: 3.167 GiB, 1.43% gc time)
julia> df1 = DataFrame(a=Int32.(1:10^8));
julia> @time leftjoin!(df1, df2, on=:a);
18.909113 seconds (175 allocations: 3.167 GiB, 1.40% gc time)
and now the timing for data.table
:
> df1 = data.table(a=sample(1:10^8))
> df2 = data.table(a=sample(1:10^8), x=TRUE)
> system.time(df1[df2, on = 'a', x := i.x])
user system elapsed
30.778 1.791 23.153
> df1 = data.table(a=sample(1:10^8))
> system.time(df1[df2, on = 'a', x := i.x])
user system elapsed
30.586 1.695 22.893
Again the timing of DataFrames.jl is competitive.
(let me stress here that this is just one set of examples and relative
performance of different packages can vary depending on the data and the
operating environment; the point of these tests is to show that currently
DataFrmes.jl is not much worse than data.table
in execution of joins, as this
was a performance bottleneck of DataFrames.jl in the past)
Conclusions
This time in the post I have focused on a single function: the leftjoin!
.
The reason is that I believe that addition of an in-place left join to
DataFrames.jl is quite significant since it is needed in many data processing
scenarios, especially when working with large tables.