Categorical vs pooled arrays

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/01/21/categorical.html

Introduction

When working with data in Julia you most likely have encountered two packages
PooledArrays.jl and CategoricalArrays.jl. They both provide
custom implementations of AbstractArray that are useful when you have low
cardinality data.

In this post I want to highlight major similarities and differences between
these two packages.

The examples were tested under Julia 1.7.0, DataFrames.jl 1.3.1,
PooledArrays.jl 1.4.0, CategoricalArrays.jl 0.10.2, FreqTables.jl 0.4.5,
and BenchmarkTols 1.2.2.

PooledArrays.jl

As I have explained in my recent post you can use PooledArrays.jl when
you have large vectors that have few unique values. The simplest way of thinking
about PooledArray type is that it can be used as a way to compress the data.

Here is an example:

julia> using PooledArrays

julia> v = repeat(["a", "b", "c", "d"], 10^6)
4000000-element Vector{String}:
 "a"
 "b"
 "c"
 "d"
 ⋮
 "a"
 "b"
 "c"
 "d"

julia> Base.summarysize(v)
32000076

julia> pv = PooledArray(v)
4000000-element PooledVector{String, UInt32, Vector{UInt32}}:
 "a"
 "b"
 "c"
 "d"
 ⋮
 "a"
 "b"
 "c"
 "d"

julia> Base.summarysize(pv)
16000580

As you can see even for short strings we have achieved significant compression.
You can have even better compression by passing compress=true keyword argument:

julia> pv2 = PooledArray(v, compress=true)
4000000-element PooledVector{String, UInt8, Vector{UInt8}}:
 "a"
 "b"
 "c"
 "d"
 ⋮
 "b"
 "c"
 "d"

julia> Base.summarysize(PooledArray(v, compress=true))
4000532

There is one more benefit of using PooledArray when working with DataFrames.jl.
You can expect grouping and join operations to be performed faster. Here are
some examples:

julia> using DataFrames

julia> using BenchmarkTools

julia> df = DataFrame(;v, pv, pv2)
4000000×3 DataFrame
     Row │ v       pv      pv2
         │ String  String  String
─────────┼────────────────────────
       1 │ a       a       a
       2 │ b       b       b
       3 │ c       c       c
    ⋮    │   ⋮       ⋮       ⋮
 3999998 │ b       b       b
 3999999 │ c       c       c
 4000000 │ d       d       d
              3999994 rows omitted

julia> @benchmark groupby($df, :v)
BenchmarkTools.Trial: 61 samples with 1 evaluation.
 Range (min … max):  66.280 ms … 170.877 ms  ┊ GC (min … max): 0.68% … 58.51%
 Time  (median):     79.254 ms               ┊ GC (median):    0.62%
 Time  (mean ± σ):   82.754 ms ±  17.021 ms  ┊ GC (mean ± σ):  5.53% ± 10.36%

     █          ▇
  ▃▃▃█▅▁▄█▅▄▁▁▃▄█▆▃▁▁▃▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  66.3 ms         Histogram: frequency by time          154 ms <

 Memory estimate: 125.04 MiB, allocs estimate: 37.

julia> @benchmark groupby($df, :pv)
BenchmarkTools.Trial: 678 samples with 1 evaluation.
 Range (min … max):  3.920 ms … 20.284 ms  ┊ GC (min … max): 0.00% …  4.13%
 Time  (median):     4.835 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.373 ms ±  4.111 ms  ┊ GC (mean ± σ):  5.02% ± 10.96%

  ▃█▇▃▂    ▁▅                       ▁ ▄▁▃▂
  ██████▇▆▇██▇▆▆▆▅▆▇▆█▆▄▅▁▁▆▁▁▆▄▇▅▅▆█▄█████▇▄▅▄▄▄▆▅▅▅▅▁▄▁▅▄▅ ▇
  3.92 ms      Histogram: log(frequency) by time     18.6 ms <

 Memory estimate: 30.52 MiB, allocs estimate: 61.

julia> @benchmark groupby($df, :pv2)
BenchmarkTools.Trial: 710 samples with 1 evaluation.
 Range (min … max):  3.462 ms … 19.995 ms  ┊ GC (min … max): 0.00% …  8.23%
 Time  (median):     4.356 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.035 ms ±  4.189 ms  ┊ GC (mean ± σ):  5.63% ± 11.95%

  ▂▄█▃▁   ▁  ▄▁                       ▃  ▃
  ██████▇▆█▇▇██▇▇▇▆▅▆▇▆▄▅▁▁▁▄▄▆▄▄▇▄▁▄▇██████▇▅▇▅▄▅▄▅▄▁▄▄▅▅▅▇ ▇
  3.46 ms      Histogram: log(frequency) by time       18 ms <

 Memory estimate: 30.52 MiB, allocs estimate: 61.

julia> df2 = df[1:4, :]
4×3 DataFrame
 Row │ v       pv      pv2
     │ String  String  String
─────┼────────────────────────
   1 │ a       a       a
   2 │ b       b       b
   3 │ c       c       c
   4 │ d       d       d

julia> @benchmark innerjoin($df, $df2, on=:v, makeunique=true)
BenchmarkTools.Trial: 26 samples with 1 evaluation.
 Range (min … max):  153.954 ms … 281.650 ms  ┊ GC (min … max): 1.63% … 34.70%
 Time  (median):     189.137 ms               ┊ GC (median):    1.12%
 Time  (mean ± σ):   193.827 ms ±  26.318 ms  ┊ GC (mean ± σ):  5.15% ±  8.92%

           ▁   ▁ ▁█  ▄ ▁
  ▆▁▁▆▁▁▁▁▁█▆▆▁█▁██▆▁█▁█▆▆▆▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▆ ▁
  154 ms           Histogram: frequency by time          282 ms <

 Memory estimate: 143.01 MiB, allocs estimate: 262.

julia> @benchmark innerjoin($df, $df2, on=:pv, makeunique=true)
BenchmarkTools.Trial: 32 samples with 1 evaluation.
 Range (min … max):  124.786 ms … 258.973 ms  ┊ GC (min … max): 2.46% … 42.85%
 Time  (median):     152.645 ms               ┊ GC (median):    2.56%
 Time  (mean ± σ):   157.173 ms ±  28.157 ms  ┊ GC (mean ± σ):  7.05% ±  9.68%

  ▄        ██▁ █▁ ▄▄
  █▁▆▁▁▆▁▁▁███▆██▁██▁▁▆▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▆ ▁
  125 ms           Histogram: frequency by time          259 ms <

 Memory estimate: 158.27 MiB, allocs estimate: 274.

julia> @benchmark innerjoin($df, $df2, on=:pv2, makeunique=true)
BenchmarkTools.Trial: 34 samples with 1 evaluation.
 Range (min … max):  119.717 ms … 252.698 ms  ┊ GC (min … max): 1.92% … 44.18%
 Time  (median):     143.639 ms               ┊ GC (median):    1.02%
 Time  (mean ± σ):   147.490 ms ±  26.220 ms  ┊ GC (mean ± σ):  6.05% ± 10.18%

  ▁     █   ▃▆▁  ▁
  █▁▁▁▁▄█▇▄▁███▁▁█▇▁▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▄ ▁
  120 ms           Histogram: frequency by time          253 ms <

 Memory estimate: 169.72 MiB, allocs estimate: 274.

Apart from these differences working with pooled vectors should have no
user visible differences in comparison to working with normal vectors.

CategoricalArrays.jl

The CategoricalArray type is also performing data compression, but its main
use case is when you want to represent your data as categorical in a statistical
sesnse (both ordered and unordered).

However, let me start with showing you the compression effect and grouping
and joining speedups, as they are the same as for PooledArrays.jl.

julia> using CategoricalArrays

julia> cv = categorical(v)
4000000-element CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"
 "d"
 "a"
 ⋮
 "a"
 "b"
 "c"
 "d"

julia> Base.summarysize(cv)
16000612

julia> cv2 = categorical(v, compress=true)
4000000-element CategoricalArray{String,1,UInt8}:
 "a"
 "b"
 "c"
 "d"
 "a"
 ⋮
 "a"
 "b"
 "c"
 "d"

julia> Base.summarysize(cv2)
4000564

julia> df = DataFrame(;v, cv, cv2)
4000000×3 DataFrame
     Row │ v       cv    cv2
         │ String  Cat…  Cat…
─────────┼────────────────────
       1 │ a       a       a
       2 │ b       b       b
       3 │ c       c       c
    ⋮    │   ⋮       ⋮       ⋮
 3999998 │ b       b       b
 3999999 │ c       c       c
 4000000 │ d       d       d
              3999994 rows omitted

julia> @benchmark groupby($df, :v)
BenchmarkTools.Trial: 54 samples with 1 evaluation.
 Range (min … max):  72.757 ms … 109.888 ms  ┊ GC (min … max): 0.00% … 4.32%
 Time  (median):     92.208 ms               ┊ GC (median):    3.27%
 Time  (mean ± σ):   92.917 ms ±  11.122 ms  ┊ GC (mean ± σ):  3.53% ± 2.48%

                        ▆           ▂                      ▆ █
  ▄▄▁▁▁▁▁▄▁▄▄▆▆▁▁▄▄▁▄▁▄▆█▆▁▁▁▁▁▆▁▄▆██▄▁▁▁▄▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▆█▁█ ▁
  72.8 ms         Histogram: frequency by time          108 ms <

 Memory estimate: 125.04 MiB, allocs estimate: 33.

julia> @benchmark groupby($df, :cv)
BenchmarkTools.Trial: 714 samples with 1 evaluation.
 Range (min … max):  3.930 ms … 15.026 ms  ┊ GC (min … max): 0.00% …  4.58%
 Time  (median):     4.474 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.007 ms ±  3.824 ms  ┊ GC (mean ± σ):  5.99% ± 12.26%

  ▃ █▇              ▄▁                               ▁ ▅ ▂ ▄
  █▇██▆▇▄▁▄▁▁▁▁▄▆▅▄███▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▅▁▆▆▁▁▅▅▁█▁█▇█▇█ ▇
  3.93 ms      Histogram: log(frequency) by time     14.1 ms <

 Memory estimate: 30.52 MiB, allocs estimate: 57.

julia> @benchmark groupby($df, :cv2)
BenchmarkTools.Trial: 634 samples with 1 evaluation.
 Range (min … max):  3.506 ms … 22.074 ms  ┊ GC (min … max): 0.00% …  0.00%
 Time  (median):     5.651 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.881 ms ±  4.817 ms  ┊ GC (mean ± σ):  6.49% ± 13.01%

  ▁█▆▂▂ ▃▃▂ ▄▂      ▁           ▄ ▄▁
  ████████████▆▇▅▆█▆█▇▅▁▁▅▅▆▄▇▅▆█████▇▄▅▅▆▅▄▆▅▅▁▇▄▅▆▆▇▅▇▆▆▄▄ ▇
  3.51 ms      Histogram: log(frequency) by time     21.2 ms <

 Memory estimate: 30.52 MiB, allocs estimate: 57.

julia> df2 = df[1:4, :]
4×3 DataFrame
 Row │ v       cv    cv2
     │ String  Cat…  Cat…
─────┼────────────────────
   1 │ a       a     a
   2 │ b       b     b
   3 │ c       c     c
   4 │ d       d     d

julia> @benchmark innerjoin($df, $df2, on=:v, makeunique=true)
BenchmarkTools.Trial: 27 samples with 1 evaluation.
 Range (min … max):  158.062 ms … 193.518 ms  ┊ GC (min … max): 0.46% … 1.62%
 Time  (median):     187.180 ms               ┊ GC (median):    1.66%
 Time  (mean ± σ):   186.249 ms ±   7.073 ms  ┊ GC (mean ± σ):  1.89% ± 0.88%

                                                   █▂
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁███▄▄▆▄▄▄▄▁▁▆ ▁
  158 ms           Histogram: frequency by time          194 ms <

 Memory estimate: 143.02 MiB, allocs estimate: 275.

julia> @benchmark innerjoin($df, $df2, on=:cv, makeunique=true)
BenchmarkTools.Trial: 33 samples with 1 evaluation.
 Range (min … max):  133.837 ms … 253.376 ms  ┊ GC (min … max): 0.54% … 43.65%
 Time  (median):     152.783 ms               ┊ GC (median):    1.83%
 Time  (mean ± σ):   154.100 ms ±  21.196 ms  ┊ GC (mean ± σ):  5.68% ±  8.33%

    █▂     ▅ ▅▂
  ▅▁██▅██▅▅████▁▅▁▁▁▁▁▁▁▁▁▅▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  134 ms           Histogram: frequency by time          253 ms <

 Memory estimate: 158.27 MiB, allocs estimate: 284.

julia> @benchmark innerjoin($df, $df2, on=:cv2, makeunique=true)
BenchmarkTools.Trial: 31 samples with 1 evaluation.
 Range (min … max):  118.869 ms … 292.135 ms  ┊ GC (min … max): 0.63% … 39.85%
 Time  (median):     149.992 ms               ┊ GC (median):    2.45%
 Time  (mean ± σ):   161.881 ms ±  35.070 ms  ┊ GC (mean ± σ):  8.69% ± 10.37%

            █
  ▃▁▁▁▁▃▃▁▁▃█▆▃▃▃▃▁▃▄▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃ ▁
  119 ms           Histogram: frequency by time          292 ms <

 Memory estimate: 169.72 MiB, allocs estimate: 284.

So what is the difference?

First, only a limited set of value types can be stored in categorical array:

julia> categorical([nothing])
ERROR: ArgumentError: CategoricalArray only supports AbstractString, AbstractChar and Number element types (got element type Nothing)

Second, when you retrieve values from categorical arrays they are wrapped in
a custom type:

julia> cv[1]
CategoricalValue{String, UInt32} "a"

You can extract out the underlying value using the unwrap function:

julia> unwrap(cv[1])
"a"

Why do we have this behavior? The reason is that values stored in the
categorical array are treated as categories, and the wrapped value is just a
label for some category.

Let us first check the order of levels in the cv vector and try to compare two
of its elements:

julia> levels(cv)
4-element Vector{String}:
 "a"
 "b"
 "c"
 "d"

julia> cv[1] < cv[2]
ERROR: ArgumentError: Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this

The order of levels will be reflected e.g. if you use FreqTables.jl freqtable
function on such a vector:

julia> using FreqTables

julia> freqtable(cv)
4-element Named Vector{Int64}
Dim1  │
──────┼────────
"a"   │ 1000000
"b"   │ 1000000
"c"   │ 1000000
"d"   │ 1000000

Let us make the cv vector ordered and change the order of levels:

julia> ordered!(cv, true);

julia> levels!(cv, ["d", "b", "a", "c"]);

Now try doing the same operations:

julia> levels(cv)
4-element Vector{String}:
 "d"
 "b"
 "a"
 "c"

julia> cv[1] < cv[2]
false

julia> cv[1], cv[2]
(CategoricalValue{String, UInt32} "a" (3/4), CategoricalValue{String, UInt32} "b" (2/4))

julia> freqtable(cv)
4-element Named Vector{Int64}
Dim1  │
──────┼────────
"d"   │ 1000000
"b"   │ 1000000
"a"   │ 1000000
"c"   │ 1000000

As you can see freqtable respects the ordering of levels. Also when we compared
cv[1] vs cv[2] the order is respected (i.e. since level "a" is in position
3, and level "b" is in position 2 the comparison returns false).

Conclusions

In summary, most of the time using PooledArrays.jl is what you will need if you
only require to save memory when working with large data. However, if you want
to process data that is categorical in statistical sense use
CategoricalArrays.jl. There is some mental overhead of working with categorical
values as there is a special CategoricalValue type that you need to learn how
to work with. However, the benefit is that the information about levels and
their ordering is retained and can be used if you do operations on elements
of categorical vector.