Re-posted from: https://bkamins.github.io/julialang/2022/12/23/duckdb.html
Introduction
I usually write in this blog about DataFrames.jl. This time I decided
to switch my mode to good old SQL queries.
The inspiration from the post is a follow up after the talk
“Data manipulation in Julia for pandas users” I gave last week during
Data Science Summit 2022. In that talk I explained how
DataFrames.jl compares to pandas.
After the talk, I thought to check how the comparison would look against
DuckDB, which is an excellent in-process SQL OLAP database management
system.
In this post I show how a sample analytical scenario can be executed in
DataFrames.jl and DuckDB, and how excellently they are integrated.
The code was run under Julia 1.8.2, CSV.jl 0.10.8, DataFrames.jl 1.4.4, and
DuckDB.jl 0.6.0 on a laptop with 32GB of RAM. I perform tests using 8 cores.
The task
When I was preparing the “Data manipulation in Julia for pandas users” talk
I wanted to choose some data set that is not too big (so that things are
quick) nor not too small (so that they are not too quick).
So I went to Kaggle and searched for datasets with: 5 to 10 GB of data, data in
CSV files, a few (but not many) files. Ordered such datasets by hotness and the
one on top of the list was: Instagram dataset, consisting of three tables:
instagram_posts.csv
: 42M rows of individual post information;instagram_locations.csv
: 1.2M rows of location information;instagram_profiles.csv
: 4.5M rows of user profile information.
You can find the dataset here.
For each post we can match a profile of a person who made it and a location
where it was made. Therefore I decided to do the following operations as a test:
- read the data from disk to RAM;
- drop all rows with missing data in posts;
- drop all rows with missing data in profiles, and in case of duplicate profiles
keep only one of them; - drop all rows with missing data in locations;
- left join posts and profiles;
- left join posts and locations;
- compute some aggregate statistics of posts data by country.
Below I show how these tasks can be performed in DataFrames.jl and
DuckDB.
(as a side note – doing this “as is” in pandas was not possible on my
laptop as I was running out of RAM and some of the steps in this process took
too long; therefore in my talk during the conference I presented a bit different
scenario)
Initial inspection of the data
We start with Julia to peek into data. First do it in Julia:
julia> using DataFrames
julia> using CSV
julia> using Statistics
julia> describe(CSV.read("instagram_posts.csv", DataFrame, limit=100), :eltype)
10×2 DataFrame
Row │ variable eltype
│ Symbol Type
─────┼─────────────────────────────────────────
1 │ sid Int64
2 │ sid_profile Int64
3 │ post_id String15
4 │ profile_id Int64
5 │ location_id Int64
6 │ cts String31
7 │ post_type Int64
8 │ description Union{Missing, String}
9 │ numbr_likes Int64
10 │ number_comments Int64
julia> describe(CSV.read("instagram_profiles.csv", DataFrame, limit=100), :eltype)
11×2 DataFrame
Row │ variable eltype
│ Symbol Type
─────┼─────────────────────────────────────────────
1 │ sid Int64
2 │ profile_id Int64
3 │ profile_name String31
4 │ firstname_lastname String31
5 │ description Union{Missing, String}
6 │ following Union{Missing, Int64}
7 │ followers Union{Missing, Int64}
8 │ n_posts Union{Missing, Int64}
9 │ url Union{Missing, String}
10 │ cts String31
11 │ is_business_account Union{Missing, Bool}
julia> describe(CSV.read("instagram_locations.csv", DataFrame, limit=100), :eltype)
23×2 DataFrame
Row │ variable eltype
│ Symbol Type
─────┼──────────────────────────────────────────────────
1 │ sid Int64
2 │ id Int64
3 │ name String
4 │ street Union{Missing, String}
5 │ zip Union{Missing, String15}
6 │ city Union{Missing, String}
7 │ region Missing
8 │ cd Union{Missing, String3}
9 │ phone Union{Missing, String31}
10 │ aj_exact_city_match Union{Missing, Bool}
11 │ aj_exact_country_match Union{Missing, Bool}
12 │ blurb String
13 │ dir_city_id Union{Missing, String15}
14 │ dir_city_name Union{Missing, String}
15 │ dir_city_slug Union{Missing, String}
16 │ dir_country_id Union{Missing, String3}
17 │ dir_country_name Union{Missing, String15}
18 │ lat Union{Missing, Float64}
19 │ lng Union{Missing, Float64}
20 │ primary_alias_on_fb String
21 │ slug String
22 │ website String
23 │ cts String31
Since the data frames had many columns I summarized them keeping only element
type. I loaded first 100 rows of data for inspection.
Now a similar task in DuckDB:
julia> using DuckDB
julia> con = DBInterface.connect(DuckDB.DB, ":memory:")
DuckDB.DB(":memory:")
julia> DBInterface.execute(con,
"""
DESCRIBE
SELECT *
FROM 'instagram_posts.csv'
LIMIT 100
""")
10×6 DataFrame
Row │ column_name column_type null key default extra
│ String? String? String? String? String? String?
─────┼──────────────────────────────────────────────────────────────────
1 │ sid INTEGER YES missing missing missing
2 │ sid_profile INTEGER YES missing missing missing
3 │ post_id VARCHAR YES missing missing missing
4 │ profile_id BIGINT YES missing missing missing
5 │ location_id BIGINT YES missing missing missing
6 │ cts TIMESTAMP YES missing missing missing
7 │ post_type INTEGER YES missing missing missing
8 │ description VARCHAR YES missing missing missing
9 │ numbr_likes INTEGER YES missing missing missing
10 │ number_comments INTEGER YES missing missing missing
julia> DBInterface.execute(con,
"""
DESCRIBE
SELECT *
FROM 'instagram_profiles.csv'
LIMIT 100
""")
11×6 DataFrame
Row │ column_name column_type null key default extra
│ String? String? String? String? String? String?
─────┼──────────────────────────────────────────────────────────────────────
1 │ sid INTEGER YES missing missing missing
2 │ profile_id BIGINT YES missing missing missing
3 │ profile_name VARCHAR YES missing missing missing
4 │ firstname_lastname VARCHAR YES missing missing missing
5 │ description VARCHAR YES missing missing missing
6 │ following INTEGER YES missing missing missing
7 │ followers INTEGER YES missing missing missing
8 │ n_posts INTEGER YES missing missing missing
9 │ url VARCHAR YES missing missing missing
10 │ cts VARCHAR YES missing missing missing
11 │ is_business_account BOOLEAN YES missing missing missing
julia> DBInterface.execute(con,
"""
DESCRIBE
SELECT *
FROM 'instagram_locations.csv'
LIMIT 100
""")
23×6 DataFrame
Row │ column_name column_type null key default extra
│ String? String? String? String? String? String?
─────┼─────────────────────────────────────────────────────────────────────────
1 │ sid INTEGER YES missing missing missing
2 │ id BIGINT YES missing missing missing
3 │ name VARCHAR YES missing missing missing
4 │ street VARCHAR YES missing missing missing
5 │ zip VARCHAR YES missing missing missing
6 │ city VARCHAR YES missing missing missing
7 │ region VARCHAR YES missing missing missing
8 │ cd VARCHAR YES missing missing missing
9 │ phone VARCHAR YES missing missing missing
10 │ aj_exact_city_match BOOLEAN YES missing missing missing
11 │ aj_exact_country_match BOOLEAN YES missing missing missing
12 │ blurb VARCHAR YES missing missing missing
13 │ dir_city_id VARCHAR YES missing missing missing
14 │ dir_city_name VARCHAR YES missing missing missing
15 │ dir_city_slug VARCHAR YES missing missing missing
16 │ dir_country_id VARCHAR YES missing missing missing
17 │ dir_country_name VARCHAR YES missing missing missing
18 │ lat DOUBLE YES missing missing missing
19 │ lng DOUBLE YES missing missing missing
20 │ primary_alias_on_fb VARCHAR YES missing missing missing
21 │ slug VARCHAR YES missing missing missing
22 │ website VARCHAR YES missing missing missing
23 │ cts TIMESTAMP YES missing missing missing
Note that we already see how nicely DuckDB is integrated with Julia. You get
a DataFrame
as a result, so you can work with it, if you wanted.
Performing the analysis
Let us check the performance of the requested operation end-to-end in both
ecosystems.
Start with pure Julia solution:
julia> @time begin
posts = CSV.read("instagram_posts.csv", DataFrame,
select=[:profile_id, :location_id, :number_comments])
profiles = CSV.read("instagram_profiles.csv", DataFrame,
select=[:profile_id, :n_posts])
locations = CSV.read("instagram_locations.csv", DataFrame,
select=[:id, :dir_country_name])
posts2 = dropmissing(posts)
profiles2 = combine(groupby(dropmissing(profiles), :profile_id),
:n_posts => maximum => :n_posts)
locations2 = dropmissing(locations)
leftjoin!(posts2, profiles2, on="profile_id")
leftjoin!(posts2, locations2, on="location_id" => "id")
gdf = groupby(posts2, "dir_country_name", sort=true, skipmissing=true)
combine(gdf, [:n_posts, :number_comments] .=> mean∘skipmissing)
end
23.323866 seconds (514.97 k allocations: 5.622 GiB, 3.30% gc time)
235×3 DataFrame
Row │ dir_country_name n_posts_mean_skipmissing number_comments_mean_skipmissing
│ String? Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────────────────
1 │ Afghanistan 445.58 8.10767
2 │ Albania 524.789 6.83736
3 │ Algeria 353.933 9.9979
4 │ American Samoa 117.692 7.87879
5 │ Andorra 486.206 7.01296
6 │ Angola 469.621 7.48282
7 │ Anguilla 1206.22 4.77193
8 │ Antarctica 1165.97 20.3073
9 │ Antigua 638.995 8.05999
10 │ Argentina 727.673 11.0821
11 │ Armenia 663.196 4.76728
⋮ │ ⋮ ⋮ ⋮
226 │ Uruguay 560.936 8.00079
227 │ Uzbekistan 445.975 8.84245
228 │ Vanuatu 539.349 4.63688
229 │ Vatican City 583.498 7.05353
230 │ Venezuela 551.79 9.4976
231 │ Vietnam 582.829 4.46671
232 │ Western Sahara 643.888 4.6676
233 │ Yemen 411.5 27.9416
234 │ Zambia 411.779 7.09239
235 │ Zimbabwe 518.17 8.3875
214 rows omitted
Now compare it to DuckDB:
julia> @time DBInterface.execute(con,
"""
SELECT dir_country_name,
mean(n_posts) FILTER (WHERE n_posts IS NOT NULL) n_posts_mean,
mean(number_comments) FILTER (WHERE number_comments IS NOT NULL) number_comments_mean
FROM (SELECT po.number_comments, pr.n_posts, loc.dir_country_name
FROM (SELECT profile_id, location_id, number_comments FROM 'instagram_posts.csv'
WHERE profile_id IS NOT NULL
AND location_id IS NOT NULL
AND number_comments IS NOT NULL
) po
LEFT JOIN (SELECT profile_id, max(n_posts) n_posts
FROM 'instagram_profiles.csv'
WHERE profile_id IS NOT NULL
AND n_posts IS NOT NULL
GROUP BY profile_id) pr
ON po.profile_id = pr.profile_id
LEFT JOIN (SELECT id, dir_country_name FROM 'instagram_locations.csv'
WHERE id IS NOT NULL
AND dir_country_name IS NOT NULL) loc
ON po.location_id = loc.id)
WHERE dir_country_name IS NOT NULL
GROUP BY dir_country_name
ORDER BY dir_country_name
""")
93.220900 seconds (166.60 k allocations: 8.578 MiB, 0.11% compilation time)
235×3 DataFrame
Row │ dir_country_name n_posts_mean number_comments_mean
│ String? Float64? Float64?
─────┼──────────────────────────────────────────────────────────
1 │ Afghanistan 445.58 8.10767
2 │ Albania 524.789 6.83736
3 │ Algeria 353.933 9.9979
4 │ American Samoa 117.692 7.87879
5 │ Andorra 486.206 7.01296
6 │ Angola 469.621 7.48282
7 │ Anguilla 1206.22 4.77193
8 │ Antarctica 1165.97 20.3073
9 │ Antigua 638.995 8.05999
10 │ Argentina 727.673 11.0821
11 │ Armenia 663.196 4.76728
⋮ │ ⋮ ⋮ ⋮
226 │ Uruguay 560.936 8.00079
227 │ Uzbekistan 445.975 8.84245
228 │ Vanuatu 539.349 4.63688
229 │ Vatican City 583.498 7.05353
230 │ Venezuela 551.79 9.4976
231 │ Vietnam 582.829 4.46671
232 │ Western Sahara 643.888 4.6676
233 │ Yemen 411.5 27.9416
234 │ Zambia 411.779 7.09239
235 │ Zimbabwe 518.17 8.3875
214 rows omitted
We see that we get the same result. The performance is worse. However, let
us split out the data ingest into RAM part to a separate process.
Here we will use a very nice feature of DuckDB that you can register a data
frame to be a table.
We start with Julia again:
julia> @time begin
posts2 = dropmissing(posts)
profiles2 = combine(groupby(dropmissing(profiles), :profile_id),
:n_posts => maximum => :n_posts)
locations2 = dropmissing(locations)
leftjoin!(posts2, profiles2, on="profile_id")
leftjoin!(posts2, locations2, on="location_id" => "id")
gdf = groupby(posts2, "dir_country_name", sort=true, skipmissing=true)
combine(gdf, [:n_posts, :number_comments] .=> mean∘skipmissing)
end;
5.632483 seconds (1.61 k allocations: 4.065 GiB, 2.94% gc time)
Now check DuckDB:
julia> DuckDB.register_data_frame(con, posts, "posts")
julia> DuckDB.register_data_frame(con, profiles, "profiles")
julia> DuckDB.register_data_frame(con, locations, "locations")
julia> @time DBInterface.execute(con,
"""
SELECT dir_country_name,
mean(n_posts) FILTER (WHERE n_posts IS NOT NULL) n_posts_mean,
mean(number_comments) FILTER (WHERE number_comments IS NOT NULL) number_comments_mean
FROM (SELECT po.number_comments, pr.n_posts, loc.dir_country_name
FROM (SELECT profile_id, location_id, number_comments
FROM posts
WHERE profile_id IS NOT NULL
AND location_id IS NOT NULL
AND number_comments IS NOT NULL
) po
LEFT JOIN (SELECT profile_id, max(n_posts) n_posts
FROM (SELECT profile_id, n_posts
FROM profiles
WHERE n_posts IS NOT NULL)
GROUP BY profile_id) pr
ON po.profile_id = pr.profile_id
LEFT JOIN (SELECT id, dir_country_name
FROM locations
WHERE id IS NOT NULL
AND dir_country_name IS NOT NULL) loc
ON po.location_id = loc.id)
GROUP BY dir_country_name
ORDER BY dir_country_name
""");
1.818248 seconds (616.46 k allocations: 24.675 MiB, 14.41% compilation time)
As you can see this time DuckDB is faster than Julia. Also we see that the
majority of original time was spent in reading the data from disk and that
the analysis part in both ecosystems is reasonably fast.
I have investigated into this further and the major conclusions for the
performance difference are:
- DuckDB performs the query in one shot, while in DataFrames.jl the API is
eager, materializing all intermediate tables; when I split the big SQL query
into steps materializing intermediate tables the performance difference was
significantly smaller. - joins are faster in DuckDB (this is a known issue in DataFrames.jl that we
still need to add multi-threading support to some of the algorithms we
provide); - split-apply-combine is faster in DataFrames.jl.
Conclusions
For me the key takeaways from this test were:
- DuckDB excellently integrates with Julia and DataFrames.jl. As you saw
you can register data frames as tables in DuckDB and get query results as
data frames. - DuckDB is fast. You can appreciate these differences especially if
you perform large joins (this will improve in DataFrames.jl in the future)
and when you execute complex queries (this will not likely improve in
DataFrames.jl in the near future – the current design of the package is eager
and does not support composed query optimization). - I found DataFrmes.jl code easier to read, but probably I am biased here.
Data scientists proficient with SQL probably will find DuckDB code more
natural.
Happy hacking with Julia and DuckDB!