By: Blog by Bogumił Kamiński
Re-posted from: https://bkamins.github.io/julialang/2023/11/17/pooling.html
Introduction
PooledArrays.jl is a package providing a custom array type that can be used for compression of your data if it has a few unique elements.
In this post I want to explain the design of PooledArray
object and discuss how it affects its performance.
The post was written under Julia 1.9.2, PooledArrays.jl 1.4.3, and DataFrames.jl 1.6.1.
How pooling works
The definition of the PooledArray
type is the following:
mutable struct PooledArray{T, R<:Integer, N, RA} <: AbstractArray{T, N}
refs::RA
pool::Vector{T}
invpool::Dict{T,R}
# refcount[] is 1 if only one PooledArray holds a reference to pool and invpool
refcount::Threads.Atomic{Int}
end
It represents a N
-dimensional array with element type T
.
The internal representation of data is that each unique value in a PooledArray
gets an integer
representation (reference) of type R
. Then for each element of the array the refs
field is an array
that is AbstractArray{R, N}
and keeps track of what reference value is stored for each entry of an array.
To be able to efficiently work with these reference numbers PooledArray
stores two fields:
pool
that gives information how reference numbers are mapped to actual values of type T
stored in the array;
invpool
that gives an inverse information on what is a reference number of some value of type T
.
Since many operations on pooled arrays do not change pool
and invpool
the PooledArray
has an extra optimization
that automatically ensures that two arrays that can be proven to have the same pool
and invpool
share them.
The refcount
field is used to keep track how many such arrays exist. This is used in case when we modify
pool
and invpool
of some array and want to make sure that we do not modify another PooledArray
by accident.
Let us check these properties in practice.
Test driving pooled arrays
Let us create a simple PooledArray
first:
julia> using PooledArrays
julia> pa1 = PooledArray(["a", "b", "a", "b", "a", "b"])
6-element PooledVector{String, UInt32, Vector{UInt32}}:
"a"
"b"
"a"
"b"
"a"
"b"
julia> pa1.refs
6-element Vector{UInt32}:
0x00000001
0x00000002
0x00000001
0x00000002
0x00000001
0x00000002
julia> pa1.pool
2-element Vector{String}:
"a"
"b"
julia> pa1.invpool
Dict{String, UInt32} with 2 entries:
"b" => 0x00000002
"a" => 0x00000001
julia> pa1.refcount
Base.Threads.Atomic{Int64}(1)
We can see that, by default, each entry is recorded as 4-byte UInt32
.
Additionally the pa1.refcount
tells us that only this pooled array
uses the pool
and refpool
objects that it references to.
Let us first check what happens when we operate on this array:
julia> pa2 = pa1[[1, 2, 3]]
3-element PooledVector{String, UInt32, Vector{UInt32}}:
"a"
"b"
"a"
julia> pa2.refcount
Base.Threads.Atomic{Int64}(2)
julia> pa1.refcount
Base.Threads.Atomic{Int64}(2)
julia> pa2.pool === pa1.pool
true
julia> pa2.invpool === pa1.invpool
true
As you can see, since pa2
subsets pa1
we knew that they can share their pool
and invpool
.
The refcount
field tells us that two objects reuse them.
Let us now modify the pool of pa2
:
julia> pa2[1] = "c"
"c"
julia> pa2.refcount
Base.Threads.Atomic{Int64}(1)
julia> pa1.refcount
Base.Threads.Atomic{Int64}(1)
julia> pa2.pool
3-element Vector{String}:
"a"
"b"
"c"
julia> pa1.pool
2-element Vector{String}:
"a"
"b"
As you can see the pools got automatically decoupled and refcount
is adjusted accordingly.
In summary, the benefit of pool-sharing is that we can very fast subset PooledArrays
without having to re-create pool
and invpool
.
This makes working with PooledArray
fast as long as we do not change the set of values we store in them.
The second important design aspect of PooledArray
is the R
type. As I have said, by default it is UInt32
. However,
for small pools this is inefficient. Therefore you can write:
julia> pa3 = PooledArray(pa1, compress=true)
6-element PooledVector{String, UInt8, Vector{UInt8}}:
"a"
"b"
"a"
"b"
"a"
"b"
julia> Base.summarysize(pa1)
570
julia> Base.summarysize(pa3)
504
As you can see, you can use the compress=true
keyword argument to automatically pick the minimal size of R
type that is able to keep the pool at hand.
In our case it is UInt8
, which would save a lot of memory in case of large arrays.
Why do we use UInt32
by default then?
The reason is that this type is typically efficient enough memory-wise and at the same time it ensures a pool that is large enough in most scenarios.
For example, the limitation of the UInt8
pool is that it can store up to 255 values only:
julia> pa4 = PooledArray(1:255, compress=true);
julia> pa4[1]=256
ERROR: You're using a PooledArray with ref type UInt8, which can only hold 255 values,
and you just tried to add the 256th reference. Please change the ref type
to a larger int type, or use the default ref type (UInt32).
So you have a tradeoff here. If you are sure you will not change your pool then compress=true
is a safe option.
If you know you might need to change the pool you need to pick the R
type more carefully.
What are the benefits of pooled arrays?
There are two types of benefits of PooledArray
. The first is memory footprint, the second is performance.
Let me explain them by example.
First create two large vectors storing strings:
julia> v1 = ["x$(isodd(i))" for i in 1:10^6];
julia> v2 = PooledArray(v1, compress=true);
julia> Base.summarysize(v1)
21500040
julia> Base.summarysize(v2)
1000507
As you can see there is a significant compression gain in our example by using a PooledArray
.
The second benefit is performance, especially in combination with DataFrames.jl:
julia> using DataFrames
julia> df = DataFrame(; v1, v2)
1000000×2 DataFrame
Row │ v1 v2
│ String String
─────────┼────────────────
1 │ xtrue xtrue
2 │ xfalse xfalse
3 │ xtrue xtrue
⋮ │ ⋮ ⋮
999998 │ xfalse xfalse
999999 │ xtrue xtrue
1000000 │ xfalse xfalse
999994 rows omitted
Now let us perform two example operations (I am measuring the second timing to avoid counting compilation time).
The first is aggregation:
julia> combine(groupby(df, :v1), nrow);
julia> @time combine(groupby(df, :v1), nrow);
0.025122 seconds (201 allocations: 31.271 MiB)
julia> combine(groupby(df, :v2), nrow);
julia> @time combine(groupby(df, :v2), nrow);
0.002766 seconds (227 allocations: 7.643 MiB)
As you can see doing aggregation when grouping by PooledArray
is much faster.
The second example is innerjoin
:
julia> df_ref = df[1:2, :]
2×2 DataFrame
Row │ v1 v2
│ String String
─────┼────────────────
1 │ xtrue xtrue
2 │ xfalse xfalse
julia> df_ref.val = 1:2
1:2
julia> innerjoin(df, df_ref, on=:v1, makeunique=true);
julia> @time innerjoin(df, df_ref, on=:v1, makeunique=true);
0.057885 seconds (248 allocations: 36.741 MiB, 23.00% gc time)
julia> innerjoin(df, df_ref, on=:v2, makeunique=true);
julia> @time innerjoin(df, df_ref, on=:v2, makeunique=true);
0.024692 seconds (265 allocations: 43.416 MiB)
And again we see that joining on :v2
is faster than on :v1
.
When using pooled arrays might not be a good idea?
There are three cases when you might not see benefits from using PooledArray
.
The first is when you have many unique values in your data. Then you have to pay the price
of storing refs
, pool
, and invpool
objects and all of them will be large.
The second is if you have a value that you store that has a small memory footprint, e.g. Bool
and you did not use compress=true
. In such a case refs
will take more memory than original data would.
The third case is when you create multiple copies of a PooledArray
and modify its pool. In such a case
the cost of copying of the pool
and invpool
fields might be non-negligible. Let me show you a practical
example of the third situation:
julia> df2 = DataFrame(id=1:10^6, v=PooledArray(repeat(["x$i" for i in 1:1000], 1000)))
1000000×2 DataFrame
Row │ id v
│ Int64 String
─────────┼─────────────────
1 │ 1 x1
2 │ 2 x2
3 │ 3 x3
⋮ │ ⋮ ⋮
999998 │ 999998 x998
999999 │ 999999 x999
1000000 │ 1000000 x1000
999994 rows omitted
julia> df3 = DataFrame(id=1:10^6, v=repeat(["x$i" for i in 1:1000], 1000));
Note that df2
and df3
are identical except that in df2
the :v
column is pooled and in df3
it is not.
Now let us test outerjoin
on this data:
julia> outerjoin(df2, df2, on=:id, makeunique=true);
julia> @time outerjoin(df2, df2, on=:id, makeunique=true);
0.065559 seconds (326 allocations: 30.951 MiB)
julia> outerjoin(df3, df3, on=:id, makeunique=true);
julia> @time outerjoin(df3, df3, on=:id, makeunique=true);
0.036927 seconds (274 allocations: 38.400 MiB)
Note that working with non-pooled data is faster. If we check innerjoin
this is not the case:
julia> innerjoin(df2, df2, on=:id, makeunique=true);
julia> @time innerjoin(df2, df2, on=:id, makeunique=true);
0.018188 seconds (210 allocations: 30.528 MiB)
julia> innerjoin(df3, df3, on=:id, makeunique=true);
julia> @time innerjoin(df3, df3, on=:id, makeunique=true);
0.029364 seconds (206 allocations: 38.157 MiB)
What is going on here? Let us look at the output:
julia> outerjoin(df2, df2, on=:id, makeunique=true)
1000000×3 DataFrame
Row │ id v v_1
│ Int64 String? String?
─────────┼───────────────────────────
1 │ 1 x1 x1
2 │ 2 x2 x2
3 │ 3 x3 x3
⋮ │ ⋮ ⋮ ⋮
999998 │ 999998 x998 x998
999999 │ 999999 x999 x999
1000000 │ 1000000 x1000 x1000
999994 rows omitted
julia> innerjoin(df2, df2, on=:id, makeunique=true)
1000000×3 DataFrame
Row │ id v v_1
│ Int64 String String
─────────┼─────────────────────────
1 │ 1 x1 x1
2 │ 2 x2 x2
3 │ 3 x3 x3
⋮ │ ⋮ ⋮ ⋮
999998 │ 999998 x998 x998
999999 │ 999999 x999 x999
1000000 │ 1000000 x1000 x1000
999994 rows omitted
Note that outerjoin
changes the element type of v
and v_1
columns, so pool
and invpool
need to be re-created twice,
which takes time and memory.
In innerjoin
the element type is not changed so pool
and invpool
in the output are reused from the source data frame.
Conclusions
In summary:
PooledArray
is useful if you have data that has many duplicates.
- The benefits of using
PooledArray
are lower memory footprint and the fact that some operations on it can be faster (e.g. groupby
in DataFrames.jl).
- The biggest benefit of
PooledArray
is when you do not change its pool of values. In such a case the pool
and invpool
objects are created only once
and are reused in arrays derived from the source array.
- Remember to carefully choose the type of the reference used in
PooledArray
. By default it is UInt32
, but you can pick a smaller type to get even better compression
at the expense of smaller number of unique values that your pooled array can store.
I hope you find the tips shared in this post useful in your data analysis processes.