Re-posted from: https://bkamins.github.io/julialang/2022/02/18/upcoming.html
Introduction
In this post I want to introduce two features of DataFrames.jl that will be
added in 1.4 release that are already available on main
branch. The new
additions are:
- extensions of the transformation mini-language;
- pivot table functionality of the
unstack
function.
The post was written under Julia 1.7.0 and DataFrames.jl current main
branch,
and DataFramesMeta.jl 0.10.
Environment setup
Since we want to work on a main
branch we set it up first:
(@v1.7) pkg> add DataFrames#main
Updating git-repo `https://github.com/JuliaData/DataFrames.jl.git`
Updating registry at `~/.julia/registries/General`
Updating git-repo `https://github.com/JuliaRegistries/General.git`
Resolving package versions...
Updating `~/.julia/environments/v1.7/Project.toml`
[a93c6f00] ~ DataFrames v1.3.2 ⇒ v1.4.0 `https://github.com/JuliaData/DataFrames.jl.git#main`
Updating `~/.julia/environments/v1.7/Manifest.toml`
[a93c6f00] ~ DataFrames v1.3.2 ⇒ v1.4.0 `https://github.com/JuliaData/DataFrames.jl.git#main`
Precompiling project...
2 dependencies successfully precompiled in 20 seconds (47 already precompiled)
julia> using DataFrames
Extensions of the transformation mini-language
In DataFrames.jl 1.3 the nrow
function has a special syntax as it does not
require passing source column (but optionally allows specifying output column
name) and returns the number of rows in a data frame or grouped data frame.
Here is an example:
julia> df = DataFrame(year=repeat([2020, 2021]; outer=3),
region=repeat(["north", "center", "south"]; inner=2),
sales=[500, 600, 700, 550, 630, 680])
6×3 DataFrame
Row │ year region sales
│ Int64 String Int64
─────┼──────────────────────
1 │ 2020 north 500
2 │ 2021 north 600
3 │ 2020 center 700
4 │ 2021 center 550
5 │ 2020 south 630
6 │ 2021 south 680
julia> gdf = groupby(df, :year)
GroupedDataFrame with 2 groups based on key: year
First Group (3 rows): year = 2020
Row │ year region sales
│ Int64 String Int64
─────┼──────────────────────
1 │ 2020 north 500
2 │ 2020 center 700
3 │ 2020 south 630
⋮
Last Group (3 rows): year = 2021
Row │ year region sales
│ Int64 String Int64
─────┼──────────────────────
1 │ 2021 north 600
2 │ 2021 center 550
3 │ 2021 south 680
julia> combine(df, nrow)
1×1 DataFrame
Row │ nrow
│ Int64
─────┼───────
1 │ 6
julia> combine(gdf, nrow => :region_count)
2×2 DataFrame
Row │ year region_count
│ Int64 Int64
─────┼─────────────────────
1 │ 2020 3
2 │ 2021 3
In DataFrames.jl 1.4 three more functions having the same special handling are
going to be added. They are:
eachindex
: returning row number in a data frame or in group of grouped data
frame;proprow
: returning proportion of rows in group
(only supported for grouped data frame);groupindices
: returning group number (only supported for grouped data frame);
As you can see none of these functions require passing a concrete source column
to know how to produce their result and this is the reason why they follow the
non-standard syntax.
Let me show you how they work by example. Start with eachindex
:
julia> transform(df, eachindex)
6×4 DataFrame
Row │ year region sales eachindex
│ Int64 String Int64 Int64
─────┼─────────────────────────────────
1 │ 2020 north 500 1
2 │ 2021 north 600 2
3 │ 2020 center 700 3
4 │ 2021 center 550 4
5 │ 2020 south 630 5
6 │ 2021 south 680 6
julia> transform(gdf, eachindex => :row_number)
6×4 DataFrame
Row │ year region sales row_number
│ Int64 String Int64 Int64
─────┼──────────────────────────────────
1 │ 2020 north 500 1
2 │ 2021 north 600 1
3 │ 2020 center 700 2
4 │ 2021 center 550 2
5 │ 2020 south 630 3
6 │ 2021 south 680 3
Note, that in the second example the returned row indices are within-group.
Now let us see how proprow
and groupindices
work. As I have noted above
they are only supported for GroupedDataFrame
:
julia> combine(gdf, groupindices => :group_number, proprow)
2×3 DataFrame
Row │ year group_number proprow
│ Int64 Int64 Float64
─────┼──────────────────────────────
1 │ 2020 1 0.5
2 │ 2021 2 0.5
Let us use the proprow
function to check the distribution of sum of two
tosses using standard six-sided dice additionally using DataFramesMeta.jl:
julia> using Random
julia> using DataFramesMeta
julia> Random.seed!(1234);
julia> @chain DataFrame(res=rand(1:6, 10^8) + rand(1:6, 10^8)) begin
groupby(:res)
@combine(:empirical = $proprow)
@rtransform(:theoretical = (6 - abs(7 - :res)) / 36)
end
11×3 DataFrame
Row │ res empirical theoretical
│ Int64 Float64 Float64
─────┼───────────────────────────────
1 │ 2 0.0277863 0.0277778
2 │ 3 0.0555783 0.0555556
3 │ 4 0.0832717 0.0833333
4 │ 5 0.111112 0.111111
5 │ 6 0.138934 0.138889
6 │ 7 0.166655 0.166667
7 │ 8 0.138885 0.138889
8 │ 9 0.111094 0.111111
9 │ 10 0.0833185 0.0833333
10 │ 11 0.0555788 0.0555556
11 │ 12 0.0277863 0.0277778
Pivot table functionality of the unstack
function
The second upcoming functionality is addition of valuestransform
keyword
argument to unstack
. It allows you to create pivot tables. Start with
generating some table we might want to aggregate:
julia> Random.seed!(1234);
julia> sales_df = DataFrame(year=rand(2020:2021, 100),
region=rand(["north", "center", "south"], 100),
sales=rand(100))
100×3 DataFrame
Row │ year region sales
│ Int64 String Float64
─────┼──────────────────────────
1 │ 2020 center 0.558861
2 │ 2021 center 0.122447
3 │ 2020 north 0.20889
⋮ │ ⋮ ⋮ ⋮
98 │ 2020 center 0.983299
99 │ 2020 north 0.882375
100 │ 2021 south 0.232082
94 rows omitted
julia> transform!(groupby(sales_df, [:year, :region]), eachindex => :store_id)
100×4 DataFrame
Row │ year region sales store_id
│ Int64 String Float64 Int64
─────┼────────────────────────────────────
1 │ 2020 center 0.558861 1
2 │ 2021 center 0.122447 1
3 │ 2020 north 0.20889 1
⋮ │ ⋮ ⋮ ⋮ ⋮
98 │ 2020 center 0.983299 13
99 │ 2020 north 0.882375 12
100 │ 2021 south 0.232082 19
94 rows omitted
I have generated the :store_id
column that gives us store identifier per year
and region as an exercise of eachindex
usage.
What we want to get is a pivot table showing us total sales by year and region.
Traditionally you would do it like this:
julia> combine(groupby(sales_df, [:year, :region]), :sales => sum => :total_sales)
6×3 DataFrame
Row │ year region total_sales
│ Int64 String Float64
─────┼────────────────────────────
1 │ 2020 center 7.25926
2 │ 2021 center 6.00744
3 │ 2020 north 7.10255
4 │ 2021 south 7.53942
5 │ 2020 south 9.35856
6 │ 2021 north 10.7248
and later use unstack
on the result like this:
julia> @chain sales_df begin
groupby([:year, :region])
combine(:sales => sum => :total_sales)
unstack(:region, :year, :total_sales)
end
3×3 DataFrame
Row │ region 2020 2021
│ String Float64? Float64?
─────┼────────────────────────────
1 │ center 7.25926 6.00744
2 │ north 7.10255 10.7248
3 │ south 9.35856 7.53942
Now you can do it in one step with valuestransform
keyword argument to
unstack
which allows you to pass aggregation function you want to apply
to the unstacked data:
julia> unstack(sales_df, :region, :year, :sales; valuestransform=sum)
3×3 DataFrame
Row │ region 2020 2021
│ String Float64? Float64?
─────┼────────────────────────────
1 │ center 7.25926 6.00744
2 │ north 7.10255 10.7248
3 │ south 9.35856 7.53942
Let us now perform a bit more complex transformation:
julia> @chain sales_df begin
@rtransform(:region_year = string(:region, "_", :year))
unstack(:store_id, :region_year, :sales; valuestransform=sum, fill=0.0)
end
20×7 DataFrame
Row │ store_id center_2020 center_2021 north_2020 south_2021 south_2020 north_2021
│ Int64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────────────────────────────────────
1 │ 1 0.558861 0.122447 0.20889 0.143436 0.329521 0.866798
2 │ 2 0.0380815 0.309235 0.034361 0.552426 0.310385 0.959957
3 │ 3 0.516981 0.261204 0.95434 0.413821 0.104218 0.199257
4 │ 4 0.0618067 0.075524 0.099118 0.706231 0.217979 0.370522
5 │ 5 0.0507147 0.275146 0.328642 0.975947 0.677177 0.00706048
6 │ 6 0.732012 0.0207132 0.991149 0.620299 0.814166 0.531297
7 │ 7 0.204139 0.931495 0.832146 0.201035 0.554558 0.759341
8 │ 8 0.799132 0.194907 0.731654 0.234099 0.67543 0.194442
9 │ 9 0.831656 0.894577 0.694614 0.0841972 0.339449 0.164982
10 │ 10 0.979482 0.057445 0.826054 0.177784 0.858287 0.529903
11 │ 11 0.989703 0.568018 0.519209 0.185454 0.0682582 0.315405
12 │ 12 0.513391 0.573917 0.882375 0.294397 0.465337 0.426687
13 │ 13 0.983299 0.0552395 0.0 0.145345 0.55148 0.630227
14 │ 14 0.0 0.6342 0.0 0.627799 0.0697159 0.83729
15 │ 15 0.0 0.435403 0.0 0.743466 0.756869 0.940755
16 │ 16 0.0 0.102038 0.0 0.641528 0.867504 0.978431
17 │ 17 0.0 0.495932 0.0 0.236 0.436487 0.937941
18 │ 18 0.0 0.0 0.0 0.324075 0.955728 0.241034
19 │ 19 0.0 0.0 0.0 0.232082 0.248491 0.833433
20 │ 20 0.0 0.0 0.0 0.0 0.0575235 0.0
Here, I have shown you the fill
keyword argument that is used to provide
value that should be used in cases we do not have any data for some combination
of row and column keys (by default such cells are filled with missing
).
Conclusions
I hope you will find these upcoming functionalities useful. If you have any
questions or comments regarding them please open an issue on DataFrames.jl
repository.