Re-posted from: https://bkamins.github.io/julialang/2023/10/06/tables.html
Introduction
Today I want to continue exploration of Tables.jl package functionality.
Some time ago I wrote a post about getting a schema of Tables.jl tables.
In the post I discussed, in particular, “column unioning”.
The column unioning approach allows you to put data with heterogenous entries into a single table
by filling the missing entries with missing
values.
In my previous post I discussed the Tables.dictrowtable
function that performs column unioning.
Today I want to introduce you to the Tables.dictcolumntable
function that has a similar functionality,
but uses a different storage format.
The post was written using Julia 1.9.2, Tables.jl 1.11.0, and DataFrames.jl 1.6.1.
Introduction: an example of column unioning
Let us start with a simple example of column unioning behavior:
julia> using DataFrames
julia> vnt = [(a=1, b=2), (a=3, c=4), (b=5, d=6)]
3-element Vector{NamedTuple{names, Tuple{Int64, Int64}} where names}:
(a = 1, b = 2)
(a = 3, c = 4)
(b = 5, d = 6)
julia> DataFrame(Tables.dictrowtable(vnt))
3×4 DataFrame
Row │ a b c d
│ Int64? Int64? Int64? Int64?
─────┼────────────────────────────────────
1 │ 1 2 missing missing
2 │ 3 missing 4 missing
3 │ missing 5 missing 6
julia> DataFrame(Tables.dictcolumntable(vnt))
3×4 DataFrame
Row │ a b c d
│ Int64? Int64? Int64? Int64?
─────┼────────────────────────────────────
1 │ 1 2 missing missing
2 │ 3 missing 4 missing
3 │ missing 5 missing 6
We have a heterogeneous table vnt
(a vector of named tuples).
Each of its rows has a different set of columns.
After column unioning operation we get a table (displayed as a data frame in our example) where each row has four columns and the missing values are filled with missing
.
Such situations are quite common when one e.g. processes JSON data and each object may have a varying set of attributes.
Why do we have two functions that perform column unioning?
A natural question to ask is why do we have Tables.dictrowtable
and Tables.dictcolumntable
that perform the same operation?
The reason is simple. Tables.dictrowtable
returns a row-oriented object, while Tables.dictcolumntable
a column oriented one.
We can easily check it by digging into the internals of these objects:
julia> getfield(Tables.dictrowtable(vnt), :values)
3-element Vector{Dict{Symbol, Any}}:
Dict(:a => 1, :b => 2)
Dict(:a => 3, :c => 4)
Dict(:b => 5, :d => 6)
julia> getfield(Tables.dictcolumntable(vnt), :values)
OrderedCollections.OrderedDict{Symbol, AbstractVector} with 4 entries:
:a => Union{Missing, Int64}[1, 3, missing]
:b => Union{Missing, Int64}[2, missing, 5]
:c => Union{Missing, Int64}[missing, 4, missing]
:d => Union{Missing, Int64}[missing, missing, 6]
As you can see Tables.dictrowtable
internally stores a vector of dictionaries, while Tables.dictcolumntable
a dictionary of vectors.
So the question is when you should use which? A simple answer is that most of the time Tables.dictcolumntable
is what you want,
unless you want to process data row-by-row and the data is very sparse. The reason is that creating a Dict
for each row of the data
has a big overhead. Additionally, most often analytical routines expect data stored in columns. For example DataFrame
has a columnar storage.
Therefore, creating a data frame from a Tables.dictcolumntable
object will be much faster.
Let us make a small test (I repeat the same @time
call several times to capture the variability of the results and make sure all gets compiled):
julia> vnt2 = repeat(vnt, 10^6);
julia> @time DataFrame(Tables.dictrowtable(vnt2));
9.625842 seconds (99.00 M allocations: 5.018 GiB, 20.14% gc time)
julia> @time DataFrame(Tables.dictrowtable(vnt2));
9.792331 seconds (99.00 M allocations: 5.018 GiB, 24.41% gc time)
julia> @time DataFrame(Tables.dictcolumntable(vnt2));
2.029009 seconds (30.00 M allocations: 892.629 MiB)
julia> @time DataFrame(Tables.dictcolumntable(vnt2));
2.251707 seconds (30.00 M allocations: 892.629 MiB)
Indeed we have a faster execution time, less allocations, and less memory allocated with Tables.dictcolumntable
.
If we wanted to squeeze out more performance we could pass copycols=false
to DataFrame
constructor since we know
that in this case we can safely skip making copies of the vectors representing columns that are passed to the constructor:
julia> @time DataFrame(Tables.dictcolumntable(vnt2), copycols=false);
2.584373 seconds (30.02 M allocations: 790.861 MiB, 17.83% gc time, 3.40% compilation time)
julia> @time DataFrame(Tables.dictcolumntable(vnt2), copycols=false);
2.139848 seconds (30.00 M allocations: 789.632 MiB)
julia> @time DataFrame(Tables.dictcolumntable(vnt2), copycols=false);
1.959573 seconds (30.00 M allocations: 789.632 MiB)
Indeed we get less allocations, but the timing of the operation is only minimally better.
Conclusions
I dedicated my post today only to two functions Tables.dictrowtable
and Tables.dictcolumntable
.
The reason is that they are not commonly known, and at the same time they are very useful if you
need column unioning behavior when working with your source data. Happy hacking!