Is vector pooling considered harmful?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/11/17/pooling.html

Introduction

PooledArrays.jl is a package providing a custom array type that can be used for compression of your data if it has a few unique elements.
In this post I want to explain the design of PooledArray object and discuss how it affects its performance.

The post was written under Julia 1.9.2, PooledArrays.jl 1.4.3, and DataFrames.jl 1.6.1.

How pooling works

The definition of the PooledArray type is the following:

mutable struct PooledArray{T, R<:Integer, N, RA} <: AbstractArray{T, N}
    refs::RA
    pool::Vector{T}
    invpool::Dict{T,R}
    # refcount[] is 1 if only one PooledArray holds a reference to pool and invpool
    refcount::Threads.Atomic{Int}
end

It represents a N-dimensional array with element type T.
The internal representation of data is that each unique value in a PooledArray gets an integer
representation (reference) of type R. Then for each element of the array the refs field is an array
that is AbstractArray{R, N} and keeps track of what reference value is stored for each entry of an array.

To be able to efficiently work with these reference numbers PooledArray stores two fields:

  • pool that gives information how reference numbers are mapped to actual values of type T stored in the array;
  • invpool that gives an inverse information on what is a reference number of some value of type T.

Since many operations on pooled arrays do not change pool and invpool the PooledArray has an extra optimization
that automatically ensures that two arrays that can be proven to have the same pool and invpool share them.
The refcount field is used to keep track how many such arrays exist. This is used in case when we modify
pool and invpool of some array and want to make sure that we do not modify another PooledArray by accident.

Let us check these properties in practice.

Test driving pooled arrays

Let us create a simple PooledArray first:

julia> using PooledArrays

julia> pa1 = PooledArray(["a", "b", "a", "b", "a", "b"])
6-element PooledVector{String, UInt32, Vector{UInt32}}:
 "a"
 "b"
 "a"
 "b"
 "a"
 "b"

julia> pa1.refs
6-element Vector{UInt32}:
 0x00000001
 0x00000002
 0x00000001
 0x00000002
 0x00000001
 0x00000002

julia> pa1.pool
2-element Vector{String}:
 "a"
 "b"

julia> pa1.invpool
Dict{String, UInt32} with 2 entries:
  "b" => 0x00000002
  "a" => 0x00000001

julia> pa1.refcount
Base.Threads.Atomic{Int64}(1)

We can see that, by default, each entry is recorded as 4-byte UInt32.
Additionally the pa1.refcount tells us that only this pooled array
uses the pool and refpool objects that it references to.

Let us first check what happens when we operate on this array:

julia> pa2 = pa1[[1, 2, 3]]
3-element PooledVector{String, UInt32, Vector{UInt32}}:
 "a"
 "b"
 "a"

julia> pa2.refcount
Base.Threads.Atomic{Int64}(2)

julia> pa1.refcount
Base.Threads.Atomic{Int64}(2)

julia> pa2.pool === pa1.pool
true

julia> pa2.invpool === pa1.invpool
true

As you can see, since pa2 subsets pa1 we knew that they can share their pool and invpool.
The refcount field tells us that two objects reuse them.

Let us now modify the pool of pa2:

julia> pa2[1] = "c"
"c"

julia> pa2.refcount
Base.Threads.Atomic{Int64}(1)

julia> pa1.refcount
Base.Threads.Atomic{Int64}(1)

julia> pa2.pool
3-element Vector{String}:
 "a"
 "b"
 "c"

julia> pa1.pool
2-element Vector{String}:
 "a"
 "b"

As you can see the pools got automatically decoupled and refcount is adjusted accordingly.

In summary, the benefit of pool-sharing is that we can very fast subset PooledArrays without having to re-create pool and invpool.
This makes working with PooledArray fast as long as we do not change the set of values we store in them.

The second important design aspect of PooledArray is the R type. As I have said, by default it is UInt32. However,
for small pools this is inefficient. Therefore you can write:

julia> pa3 = PooledArray(pa1, compress=true)
6-element PooledVector{String, UInt8, Vector{UInt8}}:
 "a"
 "b"
 "a"
 "b"
 "a"
 "b"

julia> Base.summarysize(pa1)
570

julia> Base.summarysize(pa3)
504

As you can see, you can use the compress=true keyword argument to automatically pick the minimal size of R type that is able to keep the pool at hand.
In our case it is UInt8, which would save a lot of memory in case of large arrays.
Why do we use UInt32 by default then?
The reason is that this type is typically efficient enough memory-wise and at the same time it ensures a pool that is large enough in most scenarios.
For example, the limitation of the UInt8 pool is that it can store up to 255 values only:

julia> pa4 = PooledArray(1:255, compress=true);

julia> pa4[1]=256
ERROR: You're using a PooledArray with ref type UInt8, which can only hold 255 values,
and you just tried to add the 256th reference.  Please change the ref type
to a larger int type, or use the default ref type (UInt32).

So you have a tradeoff here. If you are sure you will not change your pool then compress=true is a safe option.
If you know you might need to change the pool you need to pick the R type more carefully.

What are the benefits of pooled arrays?

There are two types of benefits of PooledArray. The first is memory footprint, the second is performance.
Let me explain them by example.

First create two large vectors storing strings:

julia> v1 = ["x$(isodd(i))" for i in 1:10^6];

julia> v2 = PooledArray(v1, compress=true);

julia> Base.summarysize(v1)
21500040

julia> Base.summarysize(v2)
1000507

As you can see there is a significant compression gain in our example by using a PooledArray.

The second benefit is performance, especially in combination with DataFrames.jl:

julia> using DataFrames

julia> df = DataFrame(; v1, v2)
1000000×2 DataFrame
     Row │ v1      v2
         │ String  String
─────────┼────────────────
       1 │ xtrue   xtrue
       2 │ xfalse  xfalse
       3 │ xtrue   xtrue
    ⋮    │   ⋮       ⋮
  999998 │ xfalse  xfalse
  999999 │ xtrue   xtrue
 1000000 │ xfalse  xfalse
       999994 rows omitted

Now let us perform two example operations (I am measuring the second timing to avoid counting compilation time).

The first is aggregation:

julia> combine(groupby(df, :v1), nrow);

julia> @time combine(groupby(df, :v1), nrow);
  0.025122 seconds (201 allocations: 31.271 MiB)

julia> combine(groupby(df, :v2), nrow);

julia> @time combine(groupby(df, :v2), nrow);
  0.002766 seconds (227 allocations: 7.643 MiB)

As you can see doing aggregation when grouping by PooledArray is much faster.

The second example is innerjoin:

julia> df_ref = df[1:2, :]
2×2 DataFrame
 Row │ v1      v2
     │ String  String
─────┼────────────────
   1 │ xtrue   xtrue
   2 │ xfalse  xfalse

julia> df_ref.val = 1:2
1:2

julia> innerjoin(df, df_ref, on=:v1, makeunique=true);

julia> @time innerjoin(df, df_ref, on=:v1, makeunique=true);
  0.057885 seconds (248 allocations: 36.741 MiB, 23.00% gc time)

julia> innerjoin(df, df_ref, on=:v2, makeunique=true);

julia> @time innerjoin(df, df_ref, on=:v2, makeunique=true);
  0.024692 seconds (265 allocations: 43.416 MiB)

And again we see that joining on :v2 is faster than on :v1.

When using pooled arrays might not be a good idea?

There are three cases when you might not see benefits from using PooledArray.

The first is when you have many unique values in your data. Then you have to pay the price
of storing refs, pool, and invpool objects and all of them will be large.

The second is if you have a value that you store that has a small memory footprint, e.g. Bool
and you did not use compress=true. In such a case refs will take more memory than original data would.

The third case is when you create multiple copies of a PooledArray and modify its pool. In such a case
the cost of copying of the pool and invpool fields might be non-negligible. Let me show you a practical
example of the third situation:

julia> df2 = DataFrame(id=1:10^6, v=PooledArray(repeat(["x$i" for i in 1:1000], 1000)))
1000000×2 DataFrame
     Row │ id       v
         │ Int64    String
─────────┼─────────────────
       1 │       1  x1
       2 │       2  x2
       3 │       3  x3
    ⋮    │    ⋮       ⋮
  999998 │  999998  x998
  999999 │  999999  x999
 1000000 │ 1000000  x1000
        999994 rows omitted

julia> df3 = DataFrame(id=1:10^6, v=repeat(["x$i" for i in 1:1000], 1000));

Note that df2 and df3 are identical except that in df2 the :v column is pooled and in df3 it is not.

Now let us test outerjoin on this data:

julia> outerjoin(df2, df2, on=:id, makeunique=true);

julia> @time outerjoin(df2, df2, on=:id, makeunique=true);
  0.065559 seconds (326 allocations: 30.951 MiB)

julia> outerjoin(df3, df3, on=:id, makeunique=true);

julia> @time outerjoin(df3, df3, on=:id, makeunique=true);
  0.036927 seconds (274 allocations: 38.400 MiB)

Note that working with non-pooled data is faster. If we check innerjoin this is not the case:

julia> innerjoin(df2, df2, on=:id, makeunique=true);

julia> @time innerjoin(df2, df2, on=:id, makeunique=true);
  0.018188 seconds (210 allocations: 30.528 MiB)

julia> innerjoin(df3, df3, on=:id, makeunique=true);

julia> @time innerjoin(df3, df3, on=:id, makeunique=true);
  0.029364 seconds (206 allocations: 38.157 MiB)

What is going on here? Let us look at the output:

julia> outerjoin(df2, df2, on=:id, makeunique=true)
1000000×3 DataFrame
     Row │ id       v        v_1
         │ Int64    String?  String?
─────────┼───────────────────────────
       1 │       1  x1       x1
       2 │       2  x2       x2
       3 │       3  x3       x3
    ⋮    │    ⋮        ⋮        ⋮
  999998 │  999998  x998     x998
  999999 │  999999  x999     x999
 1000000 │ 1000000  x1000    x1000
                  999994 rows omitted

julia> innerjoin(df2, df2, on=:id, makeunique=true)
1000000×3 DataFrame
     Row │ id       v       v_1
         │ Int64    String  String
─────────┼─────────────────────────
       1 │       1  x1      x1
       2 │       2  x2      x2
       3 │       3  x3      x3
    ⋮    │    ⋮       ⋮       ⋮
  999998 │  999998  x998    x998
  999999 │  999999  x999    x999
 1000000 │ 1000000  x1000   x1000
                999994 rows omitted

Note that outerjoin changes the element type of v and v_1 columns, so pool and invpool need to be re-created twice,
which takes time and memory.
In innerjoin the element type is not changed so pool and invpool in the output are reused from the source data frame.

Conclusions

In summary:

  • PooledArray is useful if you have data that has many duplicates.
  • The benefits of using PooledArray are lower memory footprint and the fact that some operations on it can be faster (e.g. groupby in DataFrames.jl).
  • The biggest benefit of PooledArray is when you do not change its pool of values. In such a case the pool and invpool objects are created only once
    and are reused in arrays derived from the source array.
  • Remember to carefully choose the type of the reference used in PooledArray. By default it is UInt32, but you can pick a smaller type to get even better compression
    at the expense of smaller number of unique values that your pooled array can store.

I hope you find the tips shared in this post useful in your data analysis processes.