Benchmarking split-apply-combine: DataFrames.jl vs Pandas

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/05/27/strings.html

Introduction

Recently chapters 5 and 6 of my Julia for Data Analysis book were
published in MEAP. Finally we are getting to some more fun stuff like parametric
types or discussion of options for working with strings in Julia.

In the comments from the readers who decided to opt-in for early access to
the book I was asked how does performance of DataFrames.jl compare to Pandas.
Therefore today I decided to run a small benchmark.

The codes were run under:

  • Julia 1.7.2, DataFrames.jl 1.3.4, InlineStrings.jl 1.1.2,
    PooledArrays.jl 1.4.2, CategoricalArrays.jl 0.10.6,
    and BenchmarkTools.jl 1.3.1;
  • Python 3.9.12, Pandas 1.4.2.

The test scenario

I want to:

  1. generate 1,000,000 random strings of length 8 consisting of letters;
  2. put them in a data frame;
  3. group the data frame by the string column and calculate number of rows
    in every group.

In the tests I want to compare the performance of different options how strings
can be stored.

Julia code

In Julia tests I check the following storage modes of strings: Vector{String},
Vector{String15} (inline string), Vector{Symbol}, PooledVector{String},
and CategoricalVector{String}. Here is the benchmark:

julia> using Random

julia> using InlineStrings

julia> using BenchmarkTools

julia> using PooledArrays

julia> using CategoricalArrays

julia> Random.seed!(1234);

julia> df = transform!(DataFrame(str=[randstring() for _ in 1:10^6]),
                       :str .=>
                       [inlinestrings, ByRow(Symbol),
                        PooledArray, CategoricalArray] .=>
                       [:istr, :sym, :pstr, :cstr])
1000000×5 DataFrame
     Row │ str       istr      sym       pstr      cstr
         │ String    String15  Symbol    String    Cat…
─────────┼──────────────────────────────────────────────────
       1 │ KYDtLOxn  KYDtLOxn  KYDtLOxn  KYDtLOxn  KYDtLOxn
       2 │ UkZj0CRg  UkZj0CRg  UkZj0CRg  UkZj0CRg  UkZj0CRg
    ⋮    │    ⋮         ⋮         ⋮         ⋮         ⋮
  999999 │ CZL6fcG5  CZL6fcG5  CZL6fcG5  CZL6fcG5  CZL6fcG5
 1000000 │ 1MqvPpVb  1MqvPpVb  1MqvPpVb  1MqvPpVb  1MqvPpVb
                                         999996 rows omitted

julia> @belapsed combine(groupby($df, :str), nrow)
0.0486384

julia> @belapsed combine(groupby($df, :istr), nrow)
0.0461695

julia> @belapsed combine(groupby($df, :sym), nrow)
0.0422065

julia> @belapsed combine(groupby($df, :pstr), nrow)
0.0096023

julia> @belapsed combine(groupby($df, :cstr), nrow)
0.0304362

As you can see PooledVector is by far fastest. Next comes CategoricalVector,
but it is significantly slower. The performance of String, String15,
and Symbol is comparable; String is slowest and Symbol is fastest.

Python code

In Python I compared Series containing str and Categorical. Here is the
code:

import pandas as pd

import random

import string

random.seed(1234)

s = [''.join(random.choice(string.ascii_letters) for _ in range(8)) for _ in range(10**6)]

cs = pd.Categorical(s)

df = pd.DataFrame({'str': s, 'cstr': cs})

%time df.groupby(['str']).size()
CPU times: total: 1.56 s
Wall time: 1.56 s
Out[8]:
str
AAAGINxY    1
AAAIavpP    1
AAASsiaU    1
AAAfnfqH    1
AAAxQZTv    1
           ..
zzzNaWBc    1
zzzTCZdT    1
zzzmmTcP    1
zzzoySQG    1
zzzwzAsA    1
Length: 1000000, dtype: int64

%time df.groupby(['cstr']).size()
CPU times: total: 109 ms
Wall time: 105 ms
Out[9]:
cstr
AAAGINxY    1
AAAIavpP    1
AAASsiaU    1
AAAfnfqH    1
AAAxQZTv    1
           ..
zzzNaWBc    1
zzzTCZdT    1
zzzmmTcP    1
zzzoySQG    1
zzzwzAsA    1
Length: 1000000, dtype: int64

As you can see using Categorical is much faster. However, both options
are visibly slower than any of the Julia variants we have considered.

Conclusions

First, I would like to thank all reviewers and readers of my book in MEAP. They
really help to improve it and I appreciate the feedback I receive a lot.

Now let me turn to a comment on benchmark results. First, I want to remark that
I usually avoid running such comparisons as it is really hard to do them
comprehensively and objectively. Therefore, I recommend you to treat my
conclusions as a general guidance inferred from the tests I have performed:

  • DataFrames.jl was consistently faster than Pandas.
  • In both ecosystems it is better to “pool” data before running
    split-apply-combine operation; this is especially relevant if you expect
    to run such operation many times.
  • In Julia PooledArrays.jl is visibly faster than CategoricalArrays.jl.
    Therefore if you just need compression (as offered by PooledArrays.jl)
    I do not recommend you to use CategoricalArrays.jl (use it if you need a
    first-class container for categorical data). Note that we see such a big
    difference because we had very many levels in our data. If we had only
    a few levels (a typical use case for CategoricalArrays.jl) its performance
    would be much closer to PooledArrays.jl.
  • In Julia if you use non-pooled vector the choice of string storage type
    does not affect the considered timings a lot. If you would like to learn
    when this decision is important I recommend you to read this post.

I hope you will find the presented benchmarks useful in your daily data
wrangling tasks.