Re-posted from: https://bkamins.github.io/julialang/2023/09/01/tables.html
Introduction
Tables.jl is a fundamental package in the JuliaData ecosystem.
One of the key concepts used in Tables.jl is a schema of a table.
Schema is information about the column names and types of a table.
Having access to schema is useful as being able to query these properties
is constantly needed in practice.
In this post I want to discuss in what cases a table might not have
a schema information associated with it and how Tables.jl handles this case.
The post was written using Julia 1.9.2, Tables.jl 1.10.1, and DataFrames.jl 1.6.1.
What is a schema of a table?
Let us create some example tables and investigate their schema:
julia> df = DataFrame(a=[1, 2], b=[3.0, 4.0], c=[1, "a"])
2×3 DataFrame
Row │ a b c
│ Int64 Float64 Any
─────┼─────────────────────
1 │ 1 3.0 1
2 │ 2 4.0 a
julia> Tables.schema(df)
Tables.Schema:
:a Int64
:b Float64
:c Any
julia> ntv = Tables.columntable(df)
(a = [1, 2], b = [3.0, 4.0], c = Any[1, "a"])
julia> Tables.schema(ntv)
Tables.Schema:
:a Int64
:b Float64
:c Any
julia> vnt = Tables.rowtable(df)
2-element Vector{NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Any}}}:
NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Any}}((1, 3.0, 1))
NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Any}}((2, 4.0, "a"))
julia> Tables.schema(vnt)
Tables.Schema:
:a Int64
:b Float64
:c Any
I have created a data frame df
with three columns and then converted it into
a named tuple of vectors ntv
and a vector of named tuples vnt
.
All these three objects are Tables.jl tables.
In all cases we have checked that the Tables.schema
properly identifies the schema
of the se tables. They have three columns :a
, :b
, and :c
, and these columns
have types :Int64
, Float64
, and Any
.
So in what case a table might not have a schema? To understand this consider the
following transformation:
julia> vnt2 = [(; collect(pairs(r))...) for r in vnt]
2-element Vector{NamedTuple{(:a, :b, :c)}}:
(a = 1, b = 3.0, c = 1)
(a = 2, b = 4.0, c = "a")
julia> Tables.schema(vnt2)
julia> isnothing(Tables.schema(vnt2))
true
Let us understand what happens in this code. With
(; collect(pairs(r))...)
operation I make the type of the fields
of the named tuples stored in the vnt
vector concrete.
To understand this better note:
julia> typeof(vnt[1])
NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Any}}
julia> typeof((; collect(pairs(vnt[1]))...))
NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Int64}}
Note that the type of the :c
field is Any
and Int64
respectively.
This transformation means that the vnt2
vector itself does not, in consequence,
have a concrete element type:
julia> eltype(vnt2)
NamedTuple{(:a, :b, :c)}
This type only specifies column names, but not their types. In such a situation
the Tables.schema
function returns nothing
signaling to the user that the schema
of a table is unknown.
You might think that this is an artificial example but consider the following common case:
julia> vnt3 = [(a=1, b=2), (a=missing, b=3)]
2-element Vector{NamedTuple{(:a, :b)}}:
(a = 1, b = 2)
(a = missing, b = 3)
julia> Tables.schema(vnt3)
julia> isnothing(Tables.schema(vnt3))
true
As you can see when we have a vector of named tuples, a mix in of missing
values
makes it schema-less.
How does Tables.jl handle schema-less tables? Part 1
If Tables.jl detects a schema-less table it tries to dynamically detect its schema
when functions defined in Tables.jl are called. The three key functions here are:
Tables.columns
: returns an object that can be queried column-wise;Tables.rows
: returns an object that can be queried row-wise;Tables.dictrowtable
: returns a row-wise vector that performs “column unioning”.
You might wonder what this “column unioning” part means. We will get to it in a second.
First let us investigate how these functions work on our vnt2
object that is schema-less:
julia> Tables.columns(vnt2)
Tables.CopiedColumns{NamedTuple{(:a, :b, :c), Tuple{Vector{Int64}, Vector{Float64}, Vector{Any}}}} with 2 rows, 3 columns, and schema:
:a Int64
:b Float64
:c Any
julia> Tables.schema(Tables.columns(vnt2))
Tables.Schema:
:a Int64
:b Float64
:c Any
julia> Tables.rows(vnt2)
2-element Vector{NamedTuple{(:a, :b, :c)}}:
(a = 1, b = 3.0, c = 1)
(a = 2, b = 4.0, c = "a")
julia> Tables.schema(Tables.rows(vnt2))
julia> isnothing(Tables.schema(Tables.rows(vnt2)))
true
julia> Tables.dictrowtable(vnt2)
Tables.DictRowTable([:a, :b, :c], Dict{Symbol, Type}(:a => Int64, :b => Float64, :c => Union{Int64, String}), Dict{Symbol, Any}[Dict(:a => 1, :b => 3.0, :c => 1), Dict(:a => 2, :b => 4.0, :c => "a")])
julia> Tables.schema(Tables.dictrowtable(vnt2))
Tables.Schema:
:a Int64
:b Float64
:c Union{Int64, String}
As you can see Tables.columns
performed eltype detection and determined the type of column :c
as Any
.
The Tables.rows
produces a schema-less object as for vnt2
object calling Tables.rows
on it just returns the input.
Finally Tables.dictrowtable
performs a narrower eltype for column :c
which is Union{Int64, String}
.
All these results are technically correct, but it is important to note that they start to matter when we create
a data frame from the result:
julia> DataFrame(Tables.columns(vnt2))
2×3 DataFrame
Row │ a b c
│ Int64 Float64 Any
─────┼─────────────────────
1 │ 1 3.0 1
2 │ 2 4.0 a
julia> DataFrame(vnt2)
2×3 DataFrame
Row │ a b c
│ Int64 Float64 Any
─────┼─────────────────────
1 │ 1 3.0 1
2 │ 2 4.0 a
julia> DataFrame(Tables.dictrowtable(vnt2))
2×3 DataFrame
Row │ a b c
│ Int64 Float64 Union…
─────┼────────────────────────
1 │ 1 3.0 1
2 │ 2 4.0 a
In this case for Tables.columns
and Tables.rows
we have Any
as element type for :c
.
While for Tables.dictrowtable
we get the Union
. This difference might matter to you
if you wanted to later change data stored in :c
column.
However, the key difference between Tables.dictrowtable
and other methods is visible
in case when we have heterogeneous column list data.
How does Tables.jl handle schema-less tables? Part 2
What is the data that has a heterogeneous column list?
It is relatively common in practice. Let me create two examples:
julia> h1 = [(; a=1), (; a=2, b=3)]
2-element Vector{NamedTuple}:
(a = 1,)
(a = 2, b = 3)
julia> h2 = [(; a=1), (; b=3)]
2-element Vector{NamedTuple{names, Tuple{Int64}} where names}:
(a = 1,)
(b = 3,)
Both h1
and h2
have a different set of columns. How does the
Tables.jl handle them? Let us check by creating a data frame from them:
julia> DataFrame(Tables.columns(h1))
2×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
julia> DataFrame(Tables.rows(h1))
2×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
julia> DataFrame(Tables.dictrowtable(h1))
2×2 DataFrame
Row │ a b
│ Int64 Int64?
─────┼────────────────
1 │ 1 missing
2 │ 2 3
For the h1
input we see that using Tables.columns
and Tables.rows
drops the :b
column while Tables.dictrowtable
keeps it and fills missing entries with missing
.
This behavior is called “column unioning”.
Now let us check the h2
case:
julia> DataFrame(Tables.columns(h2))
ERROR: type NamedTuple has no field a
julia> DataFrame(Tables.rows(h2))
ERROR: type NamedTuple has no field a
julia> DataFrame(Tables.dictrowtable(h2))
2×2 DataFrame
Row │ a b
│ Int64? Int64?
─────┼──────────────────
1 │ 1 missing
2 │ missing 3
Now we have an even stranger behavior. The first two calls error, while the
Tables.dictrowtable
works by “column unioning”. Why do the first two cases error?
The reason is that DataFrame
constructor internally always calls Tables.columns
on a passed table. And Tables.columns
when doing column detection assumes that
the list of columns for row-wise table is given by its first row. So even
without calling DataFrame
we get an error when we do:
julia> Tables.columns(h2)
ERROR: type NamedTuple has no field a
In summary we have two cases of behaviors for row-wise tables:
- assume the columns are given in the first row of a table;
this is whatTables.columns
does, and, in consequenceDataFrame
constructor; - perform “column unioning”; this is what
Tables.dictrowtable
does.
Conclusions
The take-away from the examples we have given today are as follows:
- by default
DataFrame(row_wise_table)
inherits theTables.columns
behavior and assumes
that the set of columns of the passed row-wise table is given by its first row; - if you want to override the default behavior, and want to perform “column unioning” instead
callDataFrame(Tables.dictrowtable(row_wise_table))
; this option is useful if you have
data that has heterogeneous column lists in different rows.
The examples given today were a bit low-level, but I hope they improved your understanding
how the Tables.jl functions handle different tabular data sources.