Re-posted from: https://bkamins.github.io/julialang/2023/03/24/df15.html
Introduction
The objective of this post is simple: I want to discuss
the most important new functionalities that DataFrames.jl 1.5 offers.
The changes I cover today are:
- more powerful
Cols
selector; - group order setting in
groupby
; - setting row order in joins;
- new options of handling duplicate rows in
unique
; - making
flatten
function scalar-aware.
The post was tested under Julia 1.9.0-rc1 and DataFrames.jl 1.5.0.
More powerful Cols
selector
The Cols
selector has gotten two new features:
- ability to mix condition function selector with other selectors;
- new
operator
keyword argument.
Let me explain both by example. Start with using condition function selector.
This is the old behavior that was supported:
julia> using DataFrames
julia> df = DataFrame(x1=1, x2=2, y1=3, y2=4)
1×4 DataFrame
Row │ x1 x2 y1 y2
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 2 3 4
julia> select(df, Cols(startswith("x")))
1×2 DataFrame
Row │ x1 x2
│ Int64 Int64
─────┼──────────────
1 │ 1 2
The startswith("x")
condition function was required to be the only argument to Cols
.
Now you can flexibly mix it with other selectors:
julia> select(df, Cols(startswith("x"), r"2"))
1×3 DataFrame
Row │ x1 x2 y2
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 4
julia> select(df, Cols(startswith("x"), endswith("2")))
1×3 DataFrame
Row │ x1 x2 y2
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 4
julia> select(df, Cols(startswith("x"), :y2))
1×3 DataFrame
Row │ x1 x2 y2
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 4
The second change is operator
keyword argument. By default Cols
, when passed multiple arguments
takes their union:
julia> select(df, Cols(startswith("x"), endswith("2")))
1×3 DataFrame
Row │ x1 x2 y2
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 4
However, you can pass other operators that specify the way how columns selected by individual
arguments should be combined together. For example you can take their intersection:
julia> select(df, Cols(startswith("x"), endswith("2"), operator=intersect))
1×1 DataFrame
Row │ x2
│ Int64
─────┼───────
1 │ 2
or set difference:
julia> select(df, Cols(startswith("x"), endswith("2"), operator=setdiff))
1×1 DataFrame
Row │ x1
│ Int64
─────┼───────
1 │ 1
Group order setting in groupby
Previously grouby
when you passed sort=true
sorted groups in ascending order.
Here is an example:
julia> df = DataFrame(id=["a", "c", "b"], row=1:3)
3×2 DataFrame
Row │ id row
│ String Int64
─────┼───────────────
1 │ a 1
2 │ c 2
3 │ b 3
julia> show(groupby(df, :id, sort=true), allgroups=true)
GroupedDataFrame with 3 groups based on key: id
Group 1 (1 row): id = "a"
Row │ id row
│ String Int64
─────┼───────────────
1 │ a 1
Group 2 (1 row): id = "b"
Row │ id row
│ String Int64
─────┼───────────────
1 │ b 3
Group 3 (1 row): id = "c"
Row │ id row
│ String Int64
─────┼───────────────
1 │ c 2
Now you can use pass in sort
any set of keyword arguments that sort
accepts as a named tuple.
For example if you wanted groups to be sorted in reverse order do:
julia> show(groupby(df, :id, sort=(rev=true,)), allgroups=true)
GroupedDataFrame with 3 groups based on key: id
Group 1 (1 row): id = "c"
Row │ id row
│ String Int64
─────┼───────────────
1 │ c 2
Group 2 (1 row): id = "b"
Row │ id row
│ String Int64
─────┼───────────────
1 │ b 3
Group 3 (1 row): id = "a"
Row │ id row
│ String Int64
─────┼───────────────
1 │ a 1
Setting row order in joins
By default join operations do not guarantee the order of rows in the output
(just like databases):
julia> df_left = DataFrame(id=[1, 2, 4, 5], left=1:4)
4×2 DataFrame
Row │ id left
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 4 3
4 │ 5 4
julia> df_right = DataFrame(id=[2, 1, 3, 6, 7], right=1:5)
5×2 DataFrame
Row │ id right
│ Int64 Int64
─────┼──────────────
1 │ 2 1
2 │ 1 2
3 │ 3 3
4 │ 6 4
5 │ 7 5
julia> outerjoin(df_left, df_right, on=:id)
7×3 DataFrame
Row │ id left right
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 2 2 1
2 │ 1 1 2
3 │ 4 3 missing
4 │ 5 4 missing
5 │ 3 missing 3
6 │ 6 missing 4
7 │ 7 missing 5
However, often users want the result to follow the order of rows of one of the source tables.
This can be achieved now using the order
keyword argument.
If you want the result to have rows in the order of left table (and then added
non-matching rows from the right table at the end, if needed) do:
julia> outerjoin(df_left, df_right, on=:id, order=:left)
7×3 DataFrame
Row │ id left right
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 1 2
2 │ 2 2 1
3 │ 4 3 missing
4 │ 5 4 missing
5 │ 3 missing 3
6 │ 6 missing 4
7 │ 7 missing 5
Similar option is available if you want to keep the row order of the right table:
julia> outerjoin(df_left, df_right, on=:id, order=:right)
7×3 DataFrame
Row │ id left right
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 2 2 1
2 │ 1 1 2
3 │ 3 missing 3
4 │ 6 missing 4
5 │ 7 missing 5
6 │ 4 3 missing
7 │ 5 4 missing
New options of handling duplicate rows in unique
By default unique
(and related functions unique!
and nonunique
) kept
the first duplicate row in case of duplicates. Here is an example:
julia> df = DataFrame(a=[1, 2, 3, 1, 2, 4], id=1:6)
6×2 DataFrame
Row │ a id
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 1 4
5 │ 2 5
6 │ 4 6
julia> unique(df, :a)
4×2 DataFrame
Row │ a id
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 4 6
However, users sometimes wanted a different behavior. Two more are now supported.
If instead you want to keep the last of duplicate rows pass keep=:last
:
julia> unique(df, :a, keep=:last)
4×2 DataFrame
Row │ a id
│ Int64 Int64
─────┼──────────────
1 │ 3 3
2 │ 1 4
3 │ 2 5
4 │ 4 6
While, if you do not want to keep any duplicate rows pass keep=:noduplicates
:
julia> unique(df, :a, keep=:noduplicates)
2×2 DataFrame
Row │ a id
│ Int64 Int64
─────┼──────────────
1 │ 3 3
2 │ 4 6
Making flatten
function scalar-aware
The flatten
function is often useful when one wants to unnest a column that
holds collections (e.g. vectors). Here is an example:
julia> df = DataFrame(id=1:3, col=[["a", "b"], ["c", "d"], ["e", "f"]])
3×2 DataFrame
Row │ id col
│ Int64 Array…
─────┼───────────────────
1 │ 1 ["a", "b"]
2 │ 2 ["c", "d"]
3 │ 3 ["e", "f"]
julia> flatten(df, :col)
6×2 DataFrame
Row │ id col
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 b
3 │ 2 c
4 │ 2 d
5 │ 3 e
6 │ 3 f
However, sometimes such a column might hold values that are not collections and we do not want to try expanding them.
The most common case is missing
:
julia> df = DataFrame(id=1:3, col=[["a", "b"], missing, ["e", "f"]])
3×2 DataFrame
Row │ id col
│ Int64 Array…?
─────┼───────────────────
1 │ 1 ["a", "b"]
2 │ 2 missing
3 │ 3 ["e", "f"]
julia> flatten(df, :col)
ERROR: MethodError: no method matching length(::Missing)
As you can see the operation failed as missing
is not a collection that has length (thus it cannot be expanded).
Now you can specify that missing
is a scalar that, when encountered, should not be expanded.
You do it by passing scalar=Missing
keyword argument:
julia> flatten(df, :col, scalar=Missing)
5×2 DataFrame
Row │ id col
│ Int64 String?
─────┼────────────────
1 │ 1 a
2 │ 1 b
3 │ 2 missing
4 │ 3 e
5 │ 3 f
Conclusions
I hope you will find the new additions that were introduced in DataFrames.jl useful for your data wrangling tasks!