Re-posted from: https://bkamins.github.io/julialang/2023/12/15/table.html
Introduction
In Julia we have a common interface for working with tabular data. It is provided by the Tables.jl package.
The fact that such an interface is defined greatly simplifies interoperability between packages.
However, it introduces also a challenge. User needs to decide which concrete type of table to use?
From my experience (biased) two most common types of tables used in practice are:
- using a
NamedTuple
; - using a
DataFrame
.
In this post I want to compare them so that you get a guidance which one and when to choose in your projects.
The post was tested under Julia 1.9.2 and DataFrames.jl 1.6.1.
Dependency level
NamedTuple
is a type provided by Base Julia. DataFrame
is defined in DataFrames.jl.
This means that you always have access to NamedTuple
, while for DataFrame
you need
to install DataFrames.jl and later load it.
DataFrames.jl is a relatively big package. Its installation and precompilation takes over 1 minute.
This is maybe not a huge time, but if for some reason your project environment would require
frequent recompilation it can start feeling cumbersome.
The other aspect is package load time:
julia> @time using DataFrames
1.092920 seconds (1.27 M allocations: 78.653 MiB, 6.52% gc time, 0.46% compilation time)
Again, one second is not much, but in some applications users might want to avoid it.
In summary, NamedTuple
wins here as a more lightweight option.
Conformance with Tables.jl interface
DataFrame
is always a Tables.jl table.
NamedTuple
is considered to be a table only if its fields are AbstracVector
.
This limitation introduces an extra level of effort. User needs to ensure and check this property.
Let me give you one example when it is relevant:
julia> DataFrame(a=0, b=[1, 2]) # a table
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 0 1
2 │ 0 2
julia> (a=0, b=[1, 2]) # not a table
(a = 0, b = [1, 2])
julia> Tables.istable((a=0, b=[1,2]))
false
Additionally NamedTuple
does not provide an automatic check if the lengths of all columns match:
julia> Tables.istable((a=[0], b=[1,2]))
true
julia> DataFrame((a=[0], b=[1,2]))
ERROR: DimensionMismatch: column :a has length 1 and column :b has length 2
In summary, DataFrame
wins here as it is safer. With NamedTuple
you need to do additional manual checks if the data you are working with is a table indeed.
Data safety
When creating a DataFrame
it copies data by default (this can be overriden by copycols=false
in the constructor):
julia> x = [1, 2]
2-element Vector{Int64}:
1
2
julia> y = [3, 4]
2-element Vector{Int64}:
3
4
julia> df = DataFrame(; x, y)
2×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> df.x === x
false
julia> df.y === y
false
This is not a case for a NamedTuple
:
julia> nt = (; x, y)
(x = [1, 2], y = [3, 4])
julia> nt.x === x
true
julia> nt.y === y
true
This design means that DataFrame
is safer. Once you create it you mostly can forget about the potential risks of modifying the source data that was used to create it.
You might think that this is an issue of a minor relevance. However, in practice, not doing a copy lead to many hard-to-find bugs (and that is why we do copy data by default when creating a DataFrame
).
In summary, DataFrame
wins here as it is safer (especially that you can disable safe behavior by passing copycols=false
to the DataFrame
constructor if you wish so).
Flexibility
A big practical difference between NamedTuple
and DataFrame
is that NamedTuple
is immutable. You cannot add, remove, or rename its columns.
On the other hand DataFrame
allows for such operations, which makes it more convenient when you need to manipulate your data.
In the flexibility dimension DataFrame
is a clear winner.
Performance
There is an opposite side of the flexibility coin. DataFrame
is not type stable, while NamedTuple
is.
This means that, if you want performance, you need to either use separate kernel functions or higher-order functions provided by DataFrames.jl (like combine
or select
).
Here is an example of performance unfriendly and performance friendly code for DataFrame
:
julia> function sum1(table)
s = 0
for v in table.x
s += v
end
return s
end
sum1 (generic function with 1 method)
julia> function sum2(table)
function kernel(x)
s = 0
for v in x
s += v
end
return s
end
return kernel(table.x)
end
sum2 (generic function with 1 method)
julia> df = DataFrame(x=1:10^8);
julia> @time sum1(df) # after compilation
5.810585 seconds (400.00 M allocations: 7.451 GiB, 1.78% gc time)
5000000050000000
julia> @time sum2(df) # after compilation
0.041333 seconds (1 allocation: 16 bytes)
5000000050000000
Note that there is no such issue with NamedTuple
:
julia> nt = NamedTuple(pairs(eachcol(df)))
(x = [1, 2, … 99999999, 100000000],)
julia> @time sum1(nt) # after compilation
0.042012 seconds (1 allocation: 16 bytes)
5000000050000000
julia> @time sum2(nt) # after compilation
0.051452 seconds (1 allocation: 16 bytes)
5000000050000000
The winner is NamedTuple
. It is easier to have good performance using it.
Compilation
The downside of NamedTuple
being compiled is that it can take a lot of time to compile a function taking
it (or even to create it) if number of columns is large. DataFrame
does not have such issues.
Here is an example:
julia> @time df = DataFrame(transpose(1:10_000), :auto)
0.008328 seconds (39.53 k allocations: 2.431 MiB)
1×10000 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x ⋯
│ Int64 Int64 Int64 Int64 Int64 Int64 I ⋯
─────┼──────────────────────────────────────────────
1 │ 1 2 3 4 5 6 ⋯
9994 columns omitted
julia> @time nt = NamedTuple(pairs(eachcol(df)));
4.660914 seconds (604.77 k allocations: 27.920 MiB, 0.20% gc time, 99.15% compilation time)
As you can see there was a significant compilation overhead of creation of nt
.
For wide tables DataFrame
is a clearly preferred option.
Display
DataFrame
uses a nicely formatted PrettyTables.jl display. NamedTuple
is not that readable.
Try displying the nt
object I have created in the last section. You will get several pages of
hard-to-read output.
DataFrame
has clearly superior default display mechanism.
Convenience
NamedTuple
is a generic data type, while DataFrame
was designed for working with tabular data specifically.
Therefore DataFrame
provides numerous convenience functionalities that NamedTuple
lacks. Let me give two
examples:
- You can select a column of a
DataFrame
using a string or aSymbol
as its name; withNamedTuple
you have to useSymbol
;
allowing for strings has two big advantages: first it is slightly easier to generate column names as strings programmatically,
second – it is easier to type column names containing special characters, like e.g. whitespace, for instance"some column name"
is inconvenient to work with usingNamedTuple
. - You have convenient column selectors like
Cols
or regular expressions which work withDataFrame
and are not supported by
NamedTuple
.
If you need convenience DataFrame
should be your preference.
Functionality
Last, but not least DataFrame
comes with dozens of convenience functions provided by DataFrames.jl
package. These include split-apply-combine, joining, sorting, subsetting, broadcasting etc. of
DataFrame
objects. None of this is available for NamedTuple
out of the box.
Indeed there are extra packages that work nicely with NamedTuple
, but this means that you need
to install and load them separately (and usually you will need several to get your job done).
Additionally there is a host of convenience packages (like DataFramesMeta.jl or Tidier.jl) that make it easier
to work with DataFrame
objects.
When it comes to functionality DataFrame
is a winner.
Metadata
DataFrame
supports storing table level and column level metadata (attributes in R or labels in Stata are a similar
concept). NamedTuple
does not provide such a functionality.
Therefore, if you want to annotate your data DataFrame
is preferable.
Concluding remarks
Let us summarize our findings:
NamedTuple
wins in: dependency level, performanceDataFrame
wins in: conformance with Tables.jl, data safety, flexibility, compilation, display, convenience, functionality, metadata
Given these considerations I would say that most of the time DataFrame
is a safe default choice for tabular data storage format.
This is especially true for interactive workflows.
However, there are cases, when you will find NamedTuple
preferable. In the Julia world usually performance gets a high priority.
NamedTuple
is preferable here especially if you would have millions of small tables, as in this case the overhead of larger DataFrame
object will be noticeable.