Working with matrices in DataFrames.jl 1.0

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/04/16/arrays.html

Introduction

DataFrames.jl will have a 1.0 release in a few days. In this post I
want to comment on an issue that I expect might cause the most legacy code
breakage with this release.

The post is tested under Julia 1.6, OrdinaryDiffEq 5.52.3, and
DataFrames.jl 0.22.7 (but I also discuss what changes in 1.0 release).

The past

In the past all matrices could be converted to a DataFrame like this:

~$ julia --banner=no
(@v1.6) pkg> st DataFrames
      Status `~/.julia/environments/v1.6/Project.toml`
  [a93c6f00] DataFrames v0.22.7

julia> using DataFrames

julia> mat = [1 2; 3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> DataFrame(mat)
2×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2
   2 │     3      4

julia> exit()

However, unfortunately this output is deceptive. Try this:

~$ julia --banner=no --depwarn=error
(@v1.6) pkg> st DataFrames
      Status `~/.julia/environments/v1.6/Project.toml`
  [a93c6f00] DataFrames v0.22.7

julia> using DataFrames

julia> mat = [1 2; 3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> DataFrame(mat)
ERROR: `DataFrame(columns::AbstractMatrix)` is deprecated, use `DataFrame(columns, :auto)` instead.

julia> exit()

As you can see DataFrame(mat) is deprecated. The problem is that
Julia 1.6 is hiding deprecation warnings by default. Under
DataFrames.jl 1.0 DataFrame(mat) will error always.

Why is DataFrame(mat) not allowed?

The reason why we decided to disallow DataFrame constructor with a single
Matrix argument is that DataFrames.jl now follows the rule:

If DataFrame constructor is passed a single positional argument this argument
must be a table that is following Tables.jl API.

The point is that Matrix does not support this API.

A person knowing the DataFrames.jl API more thoroughly might point out that the
statement above is not true, and DataFrames.jl allows for some exceptions where
a non-table is accepted in a constructor. This is indeed the case. Here are
the offending cases:

~$ julia --banner=no --depwarn=error
(@v1.6) pkg> st DataFrames
      Status `~/.julia/environments/v1.6/Project.toml`
  [a93c6f00] DataFrames v0.22.7

julia> using DataFrames

julia> DataFrame(:a => 1) # a pair
1×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1

julia> df = DataFrame([:a => 1]) # a vector of pairs
1×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1

julia> dfr = df[1, :]
DataFrameRow
 Row │ a
     │ Int64
─────┼───────
   1 │     1

julia> DataFrame(dfr) # DataFrameRow
1×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1

julia> gdf = groupby(df, :a)
GroupedDataFrame with 1 group based on key: a
First Group (1 row): a = 1
 Row │ a
     │ Int64
─────┼───────
   1 │     1

julia> DataFrame(gdf) # GroupedDataFrame
1×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1

These four cases are deliberately left for convenience because there is a very
low risk that they would cause confusion. Out of these four cases a vector of
pairs is most problematic, as in Tables.jl it would get the following treatment:

julia> DataFrame(Tables.columntable([:a => 1]))
1×2 DataFrame
 Row │ first   second
     │ Symbol  Int64
─────┼────────────────
   1 │ a            1

However, we have decided that it is extremely unlikely that someone might want
to get this type of a data frame (and in case you wanted it the code above
shows you how to get it reliably).

So why have we not made an exception for matrices? The reason is that there
are many cases where AbtractMatrix actually supports Tables.jl interface
and requires a special way how it should be converted to a DataFrame.

To give some specific example have a look at this issue related
to differential equations solving (I have adapted it a bit):

~/Desktop/Dev/DF_dev$ julia --banner=no
(@v1.6) pkg> st DataFrames
      Status `~/.julia/environments/v1.6/Project.toml`
  [a93c6f00] DataFrames v0.22.7

julia> using OrdinaryDiffEq, DataFrames

julia> function parameterized_lorenz(du, u, p, t)
        du[1] = p[1] * (u[2] - u[1])
        du[2] = u[1] * (p[2] - u[3]) - u[2]
        du[3] = u[1] * u[2] - p[3] * u[3]
       end
parameterized_lorenz (generic function with 1 method)

julia> u0 = [1.0, 0.0, 0.0];

julia> tspan = (0.0, 1.0);

julia> p = [10.0, 28.0, 8/3];

julia> prob = ODEProblem(parameterized_lorenz, u0, tspan, p);

julia> sol1 = solve(prob, Rosenbrock23());

julia> DataFrame(sol1)
3×61 DataFrame
 Row │ x1       x2           x3          x4          x5          x6           ⋯
     │ Float64  Float64      Float64     Float64     Float64     Float64      ⋯
─────┼─────────────────────────────────────────────────────────────────────────
   1 │     1.0  0.999925     0.999178    0.995241    0.989338    0.977794     ⋯
   2 │     0.0  0.000209532  0.00230391  0.0134134   0.0302987   0.0641965
   3 │     0.0  7.83999e-10  9.47873e-8  3.21298e-6  1.63912e-5  7.35689e-5
                                                             55 columns omitted

julia> DataFrame(Tables.columntable(sol1))
61×4 DataFrame
 Row │ timestamp    value1     value2        value3
     │ Float64      Float64    Float64       Float64
─────┼────────────────────────────────────────────────────
   1 │ 0.0           1.0        0.0           0.0
   2 │ 7.48361e-6    0.999925   0.000209532   7.83999e-10
   3 │ 8.23197e-5    0.999178   0.00230391    9.47873e-8
   4 │ 0.000480311   0.995241   0.0134134     3.21298e-6
   5 │ 0.00108852    0.989338   0.0302987     1.63912e-5
   6 │ 0.00232154    0.977794   0.0641965     7.35689e-5
   7 │ 0.00391981    0.963652   0.107499      0.000206214
  ⋮  │      ⋮           ⋮           ⋮             ⋮
  55 │ 0.735315     -7.14414   -8.67351      25.5598
  56 │ 0.771575     -7.66582   -9.02351      25.4699
  57 │ 0.812974     -8.20321   -9.44232      25.6821
  58 │ 0.859509     -8.74036   -9.79982      26.2593
  59 │ 0.908783     -9.18149   -9.8967       27.1128
  60 │ 0.960609     -9.41649   -9.59708      28.0144
  61 │ 1.0          -9.39804   -9.12529      28.5183
                                           47 rows omitted

Now we can see that DataFrame(sol1) produces a wrong result because sol1
is an AbstractMatrix as you can check here:

julia> sol1 isa AbstractMatrix
true

Let us switch to main branch of DataFrames.jl for the remaining of this post
to test the behavior under DataFrames.jl 1.0 that will be released soon:

~/Desktop/Dev/DF_dev$ julia --banner=no
(@v1.6) pkg> activate --temp
  Activating new environment at `/tmp/jl_43Ofes/Project.toml`

(jl_43Ofes) pkg> add DataFrames#main

(jl_43Ofes) pkg> st DataFrames
      Status `/tmp/jl_43Ofes/Project.toml`
  [a93c6f00] DataFrames v0.22.7 `https://github.com/JuliaData/DataFrames.jl.git#main`

julia> using OrdinaryDiffEq, DataFrames
[ Info: Precompiling OrdinaryDiffEq [1dea7af3-3e70-54e6-95c3-0bf5283fa5ed]

julia> function parameterized_lorenz(du, u, p, t)
        du[1] = p[1] * (u[2] - u[1])
        du[2] = u[1] * (p[2] - u[3]) - u[2]
        du[3] = u[1] * u[2] - p[3] * u[3]
       end
parameterized_lorenz (generic function with 1 method)

julia> u0 = [1.0, 0.0, 0.0];

julia> tspan = (0.0, 1.0);

julia> p = [10.0, 28.0, 8/3];

julia> prob = ODEProblem(parameterized_lorenz, u0, tspan, p);

julia> sol1 = solve(prob, Rosenbrock23());

julia> DataFrame(sol1)
61×4 DataFrame
 Row │ timestamp    value1     value2        value3
     │ Float64      Float64    Float64       Float64
─────┼────────────────────────────────────────────────────
   1 │ 0.0           1.0        0.0           0.0
   2 │ 7.48361e-6    0.999925   0.000209532   7.83999e-10
   3 │ 8.23197e-5    0.999178   0.00230391    9.47873e-8
   4 │ 0.000480311   0.995241   0.0134134     3.21298e-6
   5 │ 0.00108852    0.989338   0.0302987     1.63912e-5
   6 │ 0.00232154    0.977794   0.0641965     7.35689e-5
   7 │ 0.00391981    0.963652   0.107499      0.000206214
  ⋮  │      ⋮           ⋮           ⋮             ⋮
  56 │ 0.771575     -7.66582   -9.02351      25.4699
  57 │ 0.812974     -8.20321   -9.44232      25.6821
  58 │ 0.859509     -8.74036   -9.79982      26.2593
  59 │ 0.908783     -9.18149   -9.8967       27.1128
  60 │ 0.960609     -9.41649   -9.59708      28.0144
  61 │ 1.0          -9.39804   -9.12529      28.5183
                                           48 rows omitted

And as you can see this time all worked as expected.

How to move from a matrix to a data frame the old way?

A question is can you get an old behavior easily under DataFrames.jl 1.0?
The answer is yes. It is enough to pass :auto as a second positional argument
to treat any AbstractMatrix the old way. The key point here is that :auto
adds a second argument to a constructor, which allows to disambiguate this call
and make sure we do not try a Tables.jl fallback. So continuing our last example
we have:

julia> DataFrame(sol1, :auto)
3×61 DataFrame
 Row │ x1       x2           x3          x4          x5          x6         ⋯
     │ Float64  Float64      Float64     Float64     Float64     Float64    ⋯
─────┼───────────────────────────────────────────────────────────────────────
   1 │     1.0  0.999925     0.999178    0.995241    0.989338    0.977794   ⋯
   2 │     0.0  0.000209532  0.00230391  0.0134134   0.0302987   0.0641965
   3 │     0.0  7.83999e-10  9.47873e-8  3.21298e-6  1.63912e-5  7.35689e-5
                                                           55 columns omitted

Here are some more examples (still on main):

julia> DataFrame([1 2; 3 4])
ERROR: ArgumentError: `DataFrame` constructor from a `Matrix` requires passing :auto as a second argument to automatically generate column names: `DataFrame(matrix, :auto)`

julia> DataFrame([1 2; 3 4], :auto) # auto generated column names
2×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2
   2 │     3      4

julia> DataFrame([1 2; 3 4], [:c1, :c2]) # passing column names explicitly
2×2 DataFrame
 Row │ c1     c2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2
   2 │     3      4

So as you can see the fix is easy.

Finally let me comment that another common similar case is a vector of vectors
being passed to a DataFrame constructor. It follows the same rules:

julia> DataFrame([1:2, 3:4])
ERROR: ArgumentError: `DataFrame` constructor from a `Vector` of vectors requires passing :auto as a second argument to automatically generate column names: `DataFrame(vecs, :auto)`

julia> DataFrame([1:2, 3:4], :auto)
2×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

julia> DataFrame([1:2, 3:4], [:c1, :c2])
2×2 DataFrame
 Row │ c1     c2
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

Again – using Tables.jl default behavior would get you something unexpected
(unless you are really deep into Tables.jl mechanics 😄):

julia> DataFrame(Tables.columntable([1:2, 3:4]))
2×2 DataFrame
 Row │ start  stop
     │ Int64  Int64
─────┼──────────────
   1 │     1      2
   2 │     3      4

Conclusions

We have tried very hard to make things in DataFrames.jl 1.0 maximally consistent
with the whole Julia package ecosystem while allowing a relatively easy handling
of common data processing tasks.

Conversion from a matrix to a DataFrame is one of common hard corner cases
affected. I hope this post explains you the rationale behind the design
decisions taken in DataFrames.jl 1.0 release in this area and ways how the
DataFrame constructor should be used to give you desired results.