Welcome to DuckDB

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/12/23/duckdb.html

Introduction

I usually write in this blog about DataFrames.jl. This time I decided
to switch my mode to good old SQL queries.

The inspiration from the post is a follow up after the talk
“Data manipulation in Julia for pandas users” I gave last week during
Data Science Summit 2022. In that talk I explained how
DataFrames.jl compares to pandas.

After the talk, I thought to check how the comparison would look against
DuckDB, which is an excellent in-process SQL OLAP database management
system.

In this post I show how a sample analytical scenario can be executed in
DataFrames.jl and DuckDB, and how excellently they are integrated.

The code was run under Julia 1.8.2, CSV.jl 0.10.8, DataFrames.jl 1.4.4, and
DuckDB.jl 0.6.0 on a laptop with 32GB of RAM. I perform tests using 8 cores.

The task

When I was preparing the “Data manipulation in Julia for pandas users” talk
I wanted to choose some data set that is not too big (so that things are
quick) nor not too small (so that they are not too quick).

So I went to Kaggle and searched for datasets with: 5 to 10 GB of data, data in
CSV files, a few (but not many) files. Ordered such datasets by hotness and the
one on top of the list was: Instagram dataset, consisting of three tables:

  • instagram_posts.csv: 42M rows of individual post information;
  • instagram_locations.csv: 1.2M rows of location information;
  • instagram_profiles.csv: 4.5M rows of user profile information.

You can find the dataset here.

For each post we can match a profile of a person who made it and a location
where it was made. Therefore I decided to do the following operations as a test:

  • read the data from disk to RAM;
  • drop all rows with missing data in posts;
  • drop all rows with missing data in profiles, and in case of duplicate profiles
    keep only one of them;
  • drop all rows with missing data in locations;
  • left join posts and profiles;
  • left join posts and locations;
  • compute some aggregate statistics of posts data by country.

Below I show how these tasks can be performed in DataFrames.jl and
DuckDB.

(as a side note – doing this “as is” in pandas was not possible on my
laptop as I was running out of RAM and some of the steps in this process took
too long; therefore in my talk during the conference I presented a bit different
scenario)

Initial inspection of the data

We start with Julia to peek into data. First do it in Julia:

julia> using DataFrames

julia> using CSV

julia> using Statistics

julia> describe(CSV.read("instagram_posts.csv", DataFrame, limit=100), :eltype)
10×2 DataFrame
 Row │ variable         eltype
     │ Symbol           Type
─────┼─────────────────────────────────────────
   1 │ sid              Int64
   2 │ sid_profile      Int64
   3 │ post_id          String15
   4 │ profile_id       Int64
   5 │ location_id      Int64
   6 │ cts              String31
   7 │ post_type        Int64
   8 │ description      Union{Missing, String}
   9 │ numbr_likes      Int64
  10 │ number_comments  Int64

julia> describe(CSV.read("instagram_profiles.csv", DataFrame, limit=100), :eltype)
11×2 DataFrame
 Row │ variable             eltype
     │ Symbol               Type
─────┼─────────────────────────────────────────────
   1 │ sid                  Int64
   2 │ profile_id           Int64
   3 │ profile_name         String31
   4 │ firstname_lastname   String31
   5 │ description          Union{Missing, String}
   6 │ following            Union{Missing, Int64}
   7 │ followers            Union{Missing, Int64}
   8 │ n_posts              Union{Missing, Int64}
   9 │ url                  Union{Missing, String}
  10 │ cts                  String31
  11 │ is_business_account  Union{Missing, Bool}

julia> describe(CSV.read("instagram_locations.csv", DataFrame, limit=100), :eltype)
23×2 DataFrame
 Row │ variable                eltype
     │ Symbol                  Type
─────┼──────────────────────────────────────────────────
   1 │ sid                     Int64
   2 │ id                      Int64
   3 │ name                    String
   4 │ street                  Union{Missing, String}
   5 │ zip                     Union{Missing, String15}
   6 │ city                    Union{Missing, String}
   7 │ region                  Missing
   8 │ cd                      Union{Missing, String3}
   9 │ phone                   Union{Missing, String31}
  10 │ aj_exact_city_match     Union{Missing, Bool}
  11 │ aj_exact_country_match  Union{Missing, Bool}
  12 │ blurb                   String
  13 │ dir_city_id             Union{Missing, String15}
  14 │ dir_city_name           Union{Missing, String}
  15 │ dir_city_slug           Union{Missing, String}
  16 │ dir_country_id          Union{Missing, String3}
  17 │ dir_country_name        Union{Missing, String15}
  18 │ lat                     Union{Missing, Float64}
  19 │ lng                     Union{Missing, Float64}
  20 │ primary_alias_on_fb     String
  21 │ slug                    String
  22 │ website                 String
  23 │ cts                     String31

Since the data frames had many columns I summarized them keeping only element
type. I loaded first 100 rows of data for inspection.

Now a similar task in DuckDB:

julia> using DuckDB

julia> con = DBInterface.connect(DuckDB.DB, ":memory:")
DuckDB.DB(":memory:")

julia> DBInterface.execute(con,
           """
           DESCRIBE
           SELECT *
           FROM 'instagram_posts.csv'
           LIMIT 100
           """)
10×6 DataFrame
 Row │ column_name      column_type  null     key      default  extra
     │ String?          String?      String?  String?  String?  String?
─────┼──────────────────────────────────────────────────────────────────
   1 │ sid              INTEGER      YES      missing  missing  missing
   2 │ sid_profile      INTEGER      YES      missing  missing  missing
   3 │ post_id          VARCHAR      YES      missing  missing  missing
   4 │ profile_id       BIGINT       YES      missing  missing  missing
   5 │ location_id      BIGINT       YES      missing  missing  missing
   6 │ cts              TIMESTAMP    YES      missing  missing  missing
   7 │ post_type        INTEGER      YES      missing  missing  missing
   8 │ description      VARCHAR      YES      missing  missing  missing
   9 │ numbr_likes      INTEGER      YES      missing  missing  missing
  10 │ number_comments  INTEGER      YES      missing  missing  missing

julia> DBInterface.execute(con,
           """
           DESCRIBE
           SELECT *
           FROM 'instagram_profiles.csv'
               LIMIT 100
           """)
11×6 DataFrame
 Row │ column_name          column_type  null     key      default  extra
     │ String?              String?      String?  String?  String?  String?
─────┼──────────────────────────────────────────────────────────────────────
   1 │ sid                  INTEGER      YES      missing  missing  missing
   2 │ profile_id           BIGINT       YES      missing  missing  missing
   3 │ profile_name         VARCHAR      YES      missing  missing  missing
   4 │ firstname_lastname   VARCHAR      YES      missing  missing  missing
   5 │ description          VARCHAR      YES      missing  missing  missing
   6 │ following            INTEGER      YES      missing  missing  missing
   7 │ followers            INTEGER      YES      missing  missing  missing
   8 │ n_posts              INTEGER      YES      missing  missing  missing
   9 │ url                  VARCHAR      YES      missing  missing  missing
  10 │ cts                  VARCHAR      YES      missing  missing  missing
  11 │ is_business_account  BOOLEAN      YES      missing  missing  missing

julia> DBInterface.execute(con,
           """
           DESCRIBE
           SELECT *
           FROM 'instagram_locations.csv'
           LIMIT 100
           """)
23×6 DataFrame
 Row │ column_name             column_type  null     key      default  extra
     │ String?                 String?      String?  String?  String?  String?
─────┼─────────────────────────────────────────────────────────────────────────
   1 │ sid                     INTEGER      YES      missing  missing  missing
   2 │ id                      BIGINT       YES      missing  missing  missing
   3 │ name                    VARCHAR      YES      missing  missing  missing
   4 │ street                  VARCHAR      YES      missing  missing  missing
   5 │ zip                     VARCHAR      YES      missing  missing  missing
   6 │ city                    VARCHAR      YES      missing  missing  missing
   7 │ region                  VARCHAR      YES      missing  missing  missing
   8 │ cd                      VARCHAR      YES      missing  missing  missing
   9 │ phone                   VARCHAR      YES      missing  missing  missing
  10 │ aj_exact_city_match     BOOLEAN      YES      missing  missing  missing
  11 │ aj_exact_country_match  BOOLEAN      YES      missing  missing  missing
  12 │ blurb                   VARCHAR      YES      missing  missing  missing
  13 │ dir_city_id             VARCHAR      YES      missing  missing  missing
  14 │ dir_city_name           VARCHAR      YES      missing  missing  missing
  15 │ dir_city_slug           VARCHAR      YES      missing  missing  missing
  16 │ dir_country_id          VARCHAR      YES      missing  missing  missing
  17 │ dir_country_name        VARCHAR      YES      missing  missing  missing
  18 │ lat                     DOUBLE       YES      missing  missing  missing
  19 │ lng                     DOUBLE       YES      missing  missing  missing
  20 │ primary_alias_on_fb     VARCHAR      YES      missing  missing  missing
  21 │ slug                    VARCHAR      YES      missing  missing  missing
  22 │ website                 VARCHAR      YES      missing  missing  missing
  23 │ cts                     TIMESTAMP    YES      missing  missing  missing

Note that we already see how nicely DuckDB is integrated with Julia. You get
a DataFrame as a result, so you can work with it, if you wanted.

Performing the analysis

Let us check the performance of the requested operation end-to-end in both
ecosystems.

Start with pure Julia solution:

julia> @time begin
           posts = CSV.read("instagram_posts.csv", DataFrame,
                            select=[:profile_id, :location_id, :number_comments])
           profiles = CSV.read("instagram_profiles.csv", DataFrame,
                               select=[:profile_id, :n_posts])
           locations = CSV.read("instagram_locations.csv", DataFrame,
                                select=[:id, :dir_country_name])
           posts2 = dropmissing(posts)
           profiles2 = combine(groupby(dropmissing(profiles), :profile_id),
                               :n_posts => maximum => :n_posts)
           locations2 = dropmissing(locations)
           leftjoin!(posts2, profiles2, on="profile_id")
           leftjoin!(posts2, locations2, on="location_id" => "id")
           gdf = groupby(posts2, "dir_country_name", sort=true, skipmissing=true)
           combine(gdf, [:n_posts, :number_comments] .=> mean∘skipmissing)
       end
 23.323866 seconds (514.97 k allocations: 5.622 GiB, 3.30% gc time)
235×3 DataFrame
 Row │ dir_country_name      n_posts_mean_skipmissing  number_comments_mean_skipmissing
     │ String?               Float64                   Float64
─────┼──────────────────────────────────────────────────────────────────────────────────
   1 │ Afghanistan                            445.58                            8.10767
   2 │ Albania                                524.789                           6.83736
   3 │ Algeria                                353.933                           9.9979
   4 │ American Samoa                         117.692                           7.87879
   5 │ Andorra                                486.206                           7.01296
   6 │ Angola                                 469.621                           7.48282
   7 │ Anguilla                              1206.22                            4.77193
   8 │ Antarctica                            1165.97                           20.3073
   9 │ Antigua                                638.995                           8.05999
  10 │ Argentina                              727.673                          11.0821
  11 │ Armenia                                663.196                           4.76728
  ⋮  │          ⋮                       ⋮                             ⋮
 226 │ Uruguay                                560.936                           8.00079
 227 │ Uzbekistan                             445.975                           8.84245
 228 │ Vanuatu                                539.349                           4.63688
 229 │ Vatican City                           583.498                           7.05353
 230 │ Venezuela                              551.79                            9.4976
 231 │ Vietnam                                582.829                           4.46671
 232 │ Western Sahara                         643.888                           4.6676
 233 │ Yemen                                  411.5                            27.9416
 234 │ Zambia                                 411.779                           7.09239
 235 │ Zimbabwe                               518.17                            8.3875
                                                                        214 rows omitted

Now compare it to DuckDB:

julia> @time DBInterface.execute(con,
           """
           SELECT dir_country_name,
               mean(n_posts) FILTER (WHERE n_posts IS NOT NULL) n_posts_mean,
               mean(number_comments) FILTER (WHERE number_comments IS NOT NULL) number_comments_mean
           FROM (SELECT po.number_comments, pr.n_posts, loc.dir_country_name
                 FROM (SELECT profile_id, location_id, number_comments FROM 'instagram_posts.csv'
                       WHERE profile_id IS NOT NULL
                             AND location_id IS NOT NULL
                             AND number_comments IS NOT NULL
                       ) po
                 LEFT JOIN (SELECT profile_id, max(n_posts) n_posts
                            FROM 'instagram_profiles.csv'
                            WHERE profile_id IS NOT NULL
                                  AND n_posts IS NOT NULL
                            GROUP BY profile_id) pr
                 ON po.profile_id = pr.profile_id
                 LEFT JOIN (SELECT id, dir_country_name FROM 'instagram_locations.csv'
                            WHERE id IS NOT NULL
                                  AND dir_country_name IS NOT NULL) loc
                 ON po.location_id = loc.id)
           WHERE dir_country_name IS NOT NULL
           GROUP BY dir_country_name
           ORDER BY dir_country_name
           """)
 93.220900 seconds (166.60 k allocations: 8.578 MiB, 0.11% compilation time)
235×3 DataFrame
 Row │ dir_country_name      n_posts_mean  number_comments_mean
     │ String?               Float64?      Float64?
─────┼──────────────────────────────────────────────────────────
   1 │ Afghanistan                445.58                8.10767
   2 │ Albania                    524.789               6.83736
   3 │ Algeria                    353.933               9.9979
   4 │ American Samoa             117.692               7.87879
   5 │ Andorra                    486.206               7.01296
   6 │ Angola                     469.621               7.48282
   7 │ Anguilla                  1206.22                4.77193
   8 │ Antarctica                1165.97               20.3073
   9 │ Antigua                    638.995               8.05999
  10 │ Argentina                  727.673              11.0821
  11 │ Armenia                    663.196               4.76728
  ⋮  │          ⋮                 ⋮                 ⋮
 226 │ Uruguay                    560.936               8.00079
 227 │ Uzbekistan                 445.975               8.84245
 228 │ Vanuatu                    539.349               4.63688
 229 │ Vatican City               583.498               7.05353
 230 │ Venezuela                  551.79                9.4976
 231 │ Vietnam                    582.829               4.46671
 232 │ Western Sahara             643.888               4.6676
 233 │ Yemen                      411.5                27.9416
 234 │ Zambia                     411.779               7.09239
 235 │ Zimbabwe                   518.17                8.3875
                                                214 rows omitted

We see that we get the same result. The performance is worse. However, let
us split out the data ingest into RAM part to a separate process.

Here we will use a very nice feature of DuckDB that you can register a data
frame to be a table.

We start with Julia again:

julia> @time begin
           posts2 = dropmissing(posts)
           profiles2 = combine(groupby(dropmissing(profiles), :profile_id),
                               :n_posts => maximum => :n_posts)
           locations2 = dropmissing(locations)
           leftjoin!(posts2, profiles2, on="profile_id")
           leftjoin!(posts2, locations2, on="location_id" => "id")
           gdf = groupby(posts2, "dir_country_name", sort=true, skipmissing=true)
           combine(gdf, [:n_posts, :number_comments] .=> mean∘skipmissing)
       end;
  5.632483 seconds (1.61 k allocations: 4.065 GiB, 2.94% gc time)

Now check DuckDB:

julia> DuckDB.register_data_frame(con, posts, "posts")

julia> DuckDB.register_data_frame(con, profiles, "profiles")

julia> DuckDB.register_data_frame(con, locations, "locations")

julia> @time DBInterface.execute(con,
           """
           SELECT dir_country_name,
               mean(n_posts) FILTER (WHERE n_posts IS NOT NULL) n_posts_mean,
               mean(number_comments) FILTER (WHERE number_comments IS NOT NULL) number_comments_mean
           FROM (SELECT po.number_comments, pr.n_posts, loc.dir_country_name
                 FROM (SELECT profile_id, location_id, number_comments 
                       FROM posts
                       WHERE profile_id IS NOT NULL
                             AND location_id IS NOT NULL
                             AND number_comments IS NOT NULL
                       ) po
                 LEFT JOIN (SELECT profile_id, max(n_posts) n_posts
                            FROM (SELECT profile_id, n_posts
                                  FROM profiles
                                  WHERE n_posts IS NOT NULL)
                            GROUP BY profile_id) pr
                 ON po.profile_id = pr.profile_id
                 LEFT JOIN (SELECT id, dir_country_name
                            FROM locations
                            WHERE id IS NOT NULL
                                  AND dir_country_name IS NOT NULL) loc
                 ON po.location_id = loc.id)
           GROUP BY dir_country_name
           ORDER BY dir_country_name
           """);
  1.818248 seconds (616.46 k allocations: 24.675 MiB, 14.41% compilation time)

As you can see this time DuckDB is faster than Julia. Also we see that the
majority of original time was spent in reading the data from disk and that
the analysis part in both ecosystems is reasonably fast.

I have investigated into this further and the major conclusions for the
performance difference are:

  • DuckDB performs the query in one shot, while in DataFrames.jl the API is
    eager, materializing all intermediate tables; when I split the big SQL query
    into steps materializing intermediate tables the performance difference was
    significantly smaller.
  • joins are faster in DuckDB (this is a known issue in DataFrames.jl that we
    still need to add multi-threading support to some of the algorithms we
    provide);
  • split-apply-combine is faster in DataFrames.jl.

Conclusions

For me the key takeaways from this test were:

  • DuckDB excellently integrates with Julia and DataFrames.jl. As you saw
    you can register data frames as tables in DuckDB and get query results as
    data frames.
  • DuckDB is fast. You can appreciate these differences especially if
    you perform large joins (this will improve in DataFrames.jl in the future)
    and when you execute complex queries (this will not likely improve in
    DataFrames.jl in the near future – the current design of the package is eager
    and does not support composed query optimization).
  • I found DataFrmes.jl code easier to read, but probably I am biased here.
    Data scientists proficient with SQL probably will find DuckDB code more
    natural.

Happy hacking with Julia and DuckDB!