What is new in DataFrames.jl 1.5

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/03/24/df15.html

Introduction

The objective of this post is simple: I want to discuss
the most important new functionalities that DataFrames.jl 1.5 offers.

The changes I cover today are:

  • more powerful Cols selector;
  • group order setting in groupby;
  • setting row order in joins;
  • new options of handling duplicate rows in unique;
  • making flatten function scalar-aware.

The post was tested under Julia 1.9.0-rc1 and DataFrames.jl 1.5.0.

More powerful Cols selector

The Cols selector has gotten two new features:

  • ability to mix condition function selector with other selectors;
  • new operator keyword argument.

Let me explain both by example. Start with using condition function selector.

This is the old behavior that was supported:

julia> using DataFrames

julia> df = DataFrame(x1=1, x2=2, y1=3, y2=4)
1×4 DataFrame
 Row │ x1     x2     y1     y2
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      2      3      4

julia> select(df, Cols(startswith("x")))
1×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

The startswith("x") condition function was required to be the only argument to Cols.
Now you can flexibly mix it with other selectors:

julia> select(df, Cols(startswith("x"), r"2"))
1×3 DataFrame
 Row │ x1     x2     y2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      4

julia> select(df, Cols(startswith("x"), endswith("2")))
1×3 DataFrame
 Row │ x1     x2     y2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      4

julia> select(df, Cols(startswith("x"), :y2))
1×3 DataFrame
 Row │ x1     x2     y2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      4

The second change is operator keyword argument. By default Cols, when passed multiple arguments
takes their union:

julia> select(df, Cols(startswith("x"), endswith("2")))
1×3 DataFrame
 Row │ x1     x2     y2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      4

However, you can pass other operators that specify the way how columns selected by individual
arguments should be combined together. For example you can take their intersection:

julia> select(df, Cols(startswith("x"), endswith("2"), operator=intersect))
1×1 DataFrame
 Row │ x2
     │ Int64
─────┼───────
   1 │     2

or set difference:

julia> select(df, Cols(startswith("x"), endswith("2"), operator=setdiff))
1×1 DataFrame
 Row │ x1
     │ Int64
─────┼───────
   1 │     1

Group order setting in groupby

Previously grouby when you passed sort=true sorted groups in ascending order.
Here is an example:

julia> df = DataFrame(id=["a", "c", "b"], row=1:3)
3×2 DataFrame
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ a           1
   2 │ c           2
   3 │ b           3

julia> show(groupby(df, :id, sort=true), allgroups=true)
GroupedDataFrame with 3 groups based on key: id
Group 1 (1 row): id = "a"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ a           1
Group 2 (1 row): id = "b"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ b           3
Group 3 (1 row): id = "c"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ c           2

Now you can use pass in sort any set of keyword arguments that sort accepts as a named tuple.
For example if you wanted groups to be sorted in reverse order do:

julia> show(groupby(df, :id, sort=(rev=true,)), allgroups=true)
GroupedDataFrame with 3 groups based on key: id
Group 1 (1 row): id = "c"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ c           2
Group 2 (1 row): id = "b"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ b           3
Group 3 (1 row): id = "a"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ a           1

Setting row order in joins

By default join operations do not guarantee the order of rows in the output
(just like databases):

julia> df_left = DataFrame(id=[1, 2, 4, 5], left=1:4)
4×2 DataFrame
 Row │ id     left
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     4      3
   4 │     5      4

julia> df_right = DataFrame(id=[2, 1, 3, 6, 7], right=1:5)
5×2 DataFrame
 Row │ id     right
     │ Int64  Int64
─────┼──────────────
   1 │     2      1
   2 │     1      2
   3 │     3      3
   4 │     6      4
   5 │     7      5

julia> outerjoin(df_left, df_right, on=:id)
7×3 DataFrame
 Row │ id     left     right
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     2        2        1
   2 │     1        1        2
   3 │     4        3  missing
   4 │     5        4  missing
   5 │     3  missing        3
   6 │     6  missing        4
   7 │     7  missing        5

However, often users want the result to follow the order of rows of one of the source tables.
This can be achieved now using the order keyword argument.

If you want the result to have rows in the order of left table (and then added
non-matching rows from the right table at the end, if needed) do:

julia> outerjoin(df_left, df_right, on=:id, order=:left)
7×3 DataFrame
 Row │ id     left     right
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     1        1        2
   2 │     2        2        1
   3 │     4        3  missing
   4 │     5        4  missing
   5 │     3  missing        3
   6 │     6  missing        4
   7 │     7  missing        5

Similar option is available if you want to keep the row order of the right table:

julia> outerjoin(df_left, df_right, on=:id, order=:right)
7×3 DataFrame
 Row │ id     left     right
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     2        2        1
   2 │     1        1        2
   3 │     3  missing        3
   4 │     6  missing        4
   5 │     7  missing        5
   6 │     4        3  missing
   7 │     5        4  missing

New options of handling duplicate rows in unique

By default unique (and related functions unique! and nonunique) kept
the first duplicate row in case of duplicates. Here is an example:

julia> df = DataFrame(a=[1, 2, 3, 1, 2, 4], id=1:6)
6×2 DataFrame
 Row │ a      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     1      4
   5 │     2      5
   6 │     4      6

julia> unique(df, :a)
4×2 DataFrame
 Row │ a      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     4      6

However, users sometimes wanted a different behavior. Two more are now supported.
If instead you want to keep the last of duplicate rows pass keep=:last:

julia> unique(df, :a, keep=:last)
4×2 DataFrame
 Row │ a      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     1      4
   3 │     2      5
   4 │     4      6

While, if you do not want to keep any duplicate rows pass keep=:noduplicates:

julia> unique(df, :a, keep=:noduplicates)
2×2 DataFrame
 Row │ a      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     4      6

Making flatten function scalar-aware

The flatten function is often useful when one wants to unnest a column that
holds collections (e.g. vectors). Here is an example:

julia> df = DataFrame(id=1:3, col=[["a", "b"], ["c", "d"], ["e", "f"]])
3×2 DataFrame
 Row │ id     col
     │ Int64  Array…
─────┼───────────────────
   1 │     1  ["a", "b"]
   2 │     2  ["c", "d"]
   3 │     3  ["e", "f"]

julia> flatten(df, :col)
6×2 DataFrame
 Row │ id     col
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  b
   3 │     2  c
   4 │     2  d
   5 │     3  e
   6 │     3  f

However, sometimes such a column might hold values that are not collections and we do not want to try expanding them.
The most common case is missing:

julia> df = DataFrame(id=1:3, col=[["a", "b"], missing, ["e", "f"]])
3×2 DataFrame
 Row │ id     col
     │ Int64  Array…?
─────┼───────────────────
   1 │     1  ["a", "b"]
   2 │     2  missing
   3 │     3  ["e", "f"]

julia> flatten(df, :col)
ERROR: MethodError: no method matching length(::Missing)

As you can see the operation failed as missing is not a collection that has length (thus it cannot be expanded).

Now you can specify that missing is a scalar that, when encountered, should not be expanded.
You do it by passing scalar=Missing keyword argument:

julia> flatten(df, :col, scalar=Missing)
5×2 DataFrame
 Row │ id     col
     │ Int64  String?
─────┼────────────────
   1 │     1  a
   2 │     1  b
   3 │     2  missing
   4 │     3  e
   5 │     3  f

Conclusions

I hope you will find the new additions that were introduced in DataFrames.jl useful for your data wrangling tasks!