What is new in DataFrames.jl 1.2.0?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/07/02/dataframes-1.2.0.html

Introduction

DataFrames.jl version 1.2.0 has just been released. In this post I want
to discuss the main new user visible features we have introduced.

The codes were run under Julia 1.6.1 and DataFrames.jl 1.2.0.

New functionalities

There are three major new functionalities introduced by the 1.2.0 release. Let
me explain them one by one.

matchmissing=:notequal keyword argument in joins

Before 1.2.0 release missing values in on-columns in joins either were
considered to be equal (when matchmissing=:equal was passed) or produced an
error (when matchmissing=:error, this is a default behavior). Now you can
also pass matchmissing=:notequal in which case missing values are considered
as not matching. Here is a simple example comparing the three options:

julia> using DataFrames

julia> df1 = DataFrame(id=[1, missing, 3], left=1:3)
3×2 DataFrame
 Row │ id       left
     │ Int64?   Int64
─────┼────────────────
   1 │       1      1
   2 │ missing      2
   3 │       3      3

julia> df2 = DataFrame(id=[1, missing, missing], right=1:3)
3×2 DataFrame
 Row │ id       right
     │ Int64?   Int64
─────┼────────────────
   1 │       1      1
   2 │ missing      2
   3 │ missing      3

julia> innerjoin(df1, df2, on=:id)
ERROR: ArgumentError: missing values in key columns are not allowed when matchmissing == :error

julia> innerjoin(df1, df2, on=:id, matchmissing=:equal)
3×3 DataFrame
 Row │ id       left   right
     │ Int64?   Int64  Int64
─────┼───────────────────────
   1 │       1      1      1
   2 │ missing      2      2
   3 │ missing      2      3

julia> innerjoin(df1, df2, on=:id, matchmissing=:notequal)
1×3 DataFrame
 Row │ id      left   right
     │ Int64?  Int64  Int64
─────┼──────────────────────
   1 │      1      1      1

A new syntax for column expansion in transformation functions

Users often store nested data structures in columns of a data frame.
In such cases, a frequent request is to unnest such a column.

Before 1.2.0 release one had to perform this operation like this:

julia> df = DataFrame(col=[Dict("a"=>1, "b"=>2), Dict("a"=>3, "b"=>4)])
2×1 DataFrame
 Row │ col
     │ Dict…
─────┼──────────────────────
   1 │ Dict("b"=>2, "a"=>1)
   2 │ Dict("b"=>4, "a"=>3)

julia> transform(df, :col => identity => AsTable)
2×3 DataFrame
 Row │ col                   b      a
     │ Dict…                 Int64  Int64
─────┼────────────────────────────────────
   1 │ Dict("b"=>2, "a"=>1)      2      1
   2 │ Dict("b"=>4, "a"=>3)      4      3

Now, a simpler syntax is allowed, that does not require the user to write
identity part of the transformation specification (just like in column
renaming syntax), so the following code works

julia> transform(df, :col => AsTable)
2×3 DataFrame
 Row │ col                   b      a
     │ Dict…                 Int64  Int64
─────┼────────────────────────────────────
   1 │ Dict("b"=>2, "a"=>1)      2      1
   2 │ Dict("b"=>4, "a"=>3)      4      3

and produces the same result.

subset! now correctly updates passed GroupedDataFrame

The subset! function was a new addition in 1.0.0 release. Therefore,
given the user feedback, we are adding some polishing touches to it.

Before 1.2.0 passing a GroupedDataFrame to subset! produced a correct
result, but could potentially corrupt the passed GroupedDataFrame (a proper
information about this was given in the documentation; such a design
was chosen to improve performance). However, such a behavior was found to be
error prone. Therefore in 1.2.0 release an efficient algorithm updating not
only the parent data frame but also GroupedDataFrame itself was implemented.

Here is an example of the current behavior:

julia> using Statistics

julia> df = DataFrame(id=repeat([1, 2], 4), x=1:8)
8×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4
   5 │     1      5
   6 │     2      6
   7 │     1      7
   8 │     2      8

julia> gd = groupby(df, :id)
GroupedDataFrame with 2 groups based on key: id
First Group (4 rows): id = 1
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      3
   3 │     1      5
   4 │     1      7
⋮
Last Group (4 rows): id = 2
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
   2 │     2      4
   3 │     2      6
   4 │     2      8

julia> subset!(gd, :x => x -> x .> mean(x)) # pick rows with :x above group mean
4×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      5
   2 │     2      6
   3 │     1      7
   4 │     2      8

julia> gd
GroupedDataFrame with 2 groups based on key: id
First Group (2 rows): id = 1
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      5
   2 │     1      7
⋮
Last Group (2 rows): id = 2
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     2      6
   2 │     2      8

In the above operation both df and gd get properly updated in-place
(previously only df was changed, but gd was left unchanged, and thus
it was corrupted).

Deprecated functionality

In the beginning of development of DataFrames.jl the design of DataFrame
was very close to a matrix. Over the years the consensus was reached that
we should rather treat it as Tables.jl table. However, the legacy thinking
was still reflected in the design of hcat function, which allowed horizontal
concatenation of a data frame with a vector, just like it is allowed for
matrices. Unfortunately this approach conflicts with the fact that currently
many vectors are supporting Tables.jl table interface and when doing hcat
users would prefer them to be treated as such.

I think the issue is easiest explained with an example. The Julia session
shown below was started with --depwarn=yes flag:

julia> using DataFrames

julia> df = DataFrame(col1='a':'c')
3×1 DataFrame
 Row │ col1
     │ Char
─────┼──────
   1 │ a
   2 │ b
   3 │ c

julia> hcat(df, [(x=i, y=10+i) for i in 1:3])
┌ Warning: horizontal concatenation of data frame with a vector is deprecated. Pass DataFrame(x1=x) instead.
│   caller = ip:0x0
└ @ Core :-1
3×2 DataFrame
 Row │ col1  x1
     │ Char  NamedTup…
─────┼───────────────────────
   1 │ a     (x = 1, y = 11)
   2 │ b     (x = 2, y = 12)
   3 │ c     (x = 3, y = 13)

As you can see a vector of NamedTuples, although it is a Tables.jl table,
is just horizontally concatenated to df as a new column with auto generated
name :x1.

However, most likely the user expected the following result (but without having
to use DataFrame constructor):

julia> hcat(df, DataFrame([(x=i, y=10+i) for i in 1:3]))
3×3 DataFrame
 Row │ col1  x      y
     │ Char  Int64  Int64
─────┼────────────────────
   1 │ a         1     11
   2 │ b         2     12
   3 │ c         3     13

In order to allow this behavior in the future, as you see above, passing a
vector to hcat when the other argument is a data frame is currently deprecated.

Conclusions

I hope you will enjoy the new features we have shipped in the 1.2.0 release of
the DataFrames.jl.

Apart from the changes discussed above several minor ones, mostly in the areas
of performance, display, and documentation have been made. You can find a more
detailed list of things changed in the 1.2.0 release notes.

Also remember that the NEWS.md file in the project repository is maintained
to give synthetic information of the most important changes introduced in
the releases of DataFrames.jl.