Column unioning in Tables.jl: row vs column oriented storage

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/10/06/tables.html

Introduction

Today I want to continue exploration of Tables.jl package functionality.
Some time ago I wrote a post about getting a schema of Tables.jl tables.
In the post I discussed, in particular, “column unioning”.

The column unioning approach allows you to put data with heterogenous entries into a single table
by filling the missing entries with missing values.

In my previous post I discussed the Tables.dictrowtable function that performs column unioning.
Today I want to introduce you to the Tables.dictcolumntable function that has a similar functionality,
but uses a different storage format.

The post was written using Julia 1.9.2, Tables.jl 1.11.0, and DataFrames.jl 1.6.1.

Introduction: an example of column unioning

Let us start with a simple example of column unioning behavior:

julia> using DataFrames

julia> vnt = [(a=1, b=2), (a=3, c=4), (b=5, d=6)]
3-element Vector{NamedTuple{names, Tuple{Int64, Int64}} where names}:
 (a = 1, b = 2)
 (a = 3, c = 4)
 (b = 5, d = 6)

julia> DataFrame(Tables.dictrowtable(vnt))
3×4 DataFrame
 Row │ a        b        c        d
     │ Int64?   Int64?   Int64?   Int64?
─────┼────────────────────────────────────
   1 │       1        2  missing  missing
   2 │       3  missing        4  missing
   3 │ missing        5  missing        6

julia> DataFrame(Tables.dictcolumntable(vnt))
3×4 DataFrame
 Row │ a        b        c        d
     │ Int64?   Int64?   Int64?   Int64?
─────┼────────────────────────────────────
   1 │       1        2  missing  missing
   2 │       3  missing        4  missing
   3 │ missing        5  missing        6

We have a heterogeneous table vnt (a vector of named tuples).
Each of its rows has a different set of columns.
After column unioning operation we get a table (displayed as a data frame in our example) where each row has four columns and the missing values are filled with missing.

Such situations are quite common when one e.g. processes JSON data and each object may have a varying set of attributes.

Why do we have two functions that perform column unioning?

A natural question to ask is why do we have Tables.dictrowtable and Tables.dictcolumntable that perform the same operation?

The reason is simple. Tables.dictrowtable returns a row-oriented object, while Tables.dictcolumntable a column oriented one.

We can easily check it by digging into the internals of these objects:

julia> getfield(Tables.dictrowtable(vnt), :values)
3-element Vector{Dict{Symbol, Any}}:
 Dict(:a => 1, :b => 2)
 Dict(:a => 3, :c => 4)
 Dict(:b => 5, :d => 6)

julia> getfield(Tables.dictcolumntable(vnt), :values)
OrderedCollections.OrderedDict{Symbol, AbstractVector} with 4 entries:
  :a => Union{Missing, Int64}[1, 3, missing]
  :b => Union{Missing, Int64}[2, missing, 5]
  :c => Union{Missing, Int64}[missing, 4, missing]
  :d => Union{Missing, Int64}[missing, missing, 6]

As you can see Tables.dictrowtable internally stores a vector of dictionaries, while Tables.dictcolumntable a dictionary of vectors.

So the question is when you should use which? A simple answer is that most of the time Tables.dictcolumntable is what you want,
unless you want to process data row-by-row and the data is very sparse. The reason is that creating a Dict for each row of the data
has a big overhead. Additionally, most often analytical routines expect data stored in columns. For example DataFrame has a columnar storage.
Therefore, creating a data frame from a Tables.dictcolumntable object will be much faster.

Let us make a small test (I repeat the same @time call several times to capture the variability of the results and make sure all gets compiled):

julia> vnt2 = repeat(vnt, 10^6);

julia> @time DataFrame(Tables.dictrowtable(vnt2));
  9.625842 seconds (99.00 M allocations: 5.018 GiB, 20.14% gc time)

julia> @time DataFrame(Tables.dictrowtable(vnt2));
  9.792331 seconds (99.00 M allocations: 5.018 GiB, 24.41% gc time)

julia> @time DataFrame(Tables.dictcolumntable(vnt2));
  2.029009 seconds (30.00 M allocations: 892.629 MiB)

julia> @time DataFrame(Tables.dictcolumntable(vnt2));
  2.251707 seconds (30.00 M allocations: 892.629 MiB)

Indeed we have a faster execution time, less allocations, and less memory allocated with Tables.dictcolumntable.

If we wanted to squeeze out more performance we could pass copycols=false to DataFrame constructor since we know
that in this case we can safely skip making copies of the vectors representing columns that are passed to the constructor:

julia> @time DataFrame(Tables.dictcolumntable(vnt2), copycols=false);
  2.584373 seconds (30.02 M allocations: 790.861 MiB, 17.83% gc time, 3.40% compilation time)

julia> @time DataFrame(Tables.dictcolumntable(vnt2), copycols=false);
  2.139848 seconds (30.00 M allocations: 789.632 MiB)

julia> @time DataFrame(Tables.dictcolumntable(vnt2), copycols=false);
  1.959573 seconds (30.00 M allocations: 789.632 MiB)

Indeed we get less allocations, but the timing of the operation is only minimally better.

Conclusions

I dedicated my post today only to two functions Tables.dictrowtable and Tables.dictcolumntable.
The reason is that they are not commonly known, and at the same time they are very useful if you
need column unioning behavior when working with your source data. Happy hacking!