By: Blog by Bogumił Kamiński
Re-posted from: https://bkamins.github.io/julialang/2024/03/01/gdf.html
Introduction
One of the features of DataFrames.jl that I often find useful is that when you group
a data frame by some of its columns the resulting GroupedDataFrame
is an object
that gains new and useful functionalities.
Some time ago I have discussed how GroupedDataFrame
can be filtered. You can find
this post here. In this post and the following one that I plan to write next
week I thought that it would be useful to review other key functionalities of
a GroupedDataFrame
.
The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.
Creating a grouped data frame
You can create a GroupedDataFrame
using the groupby
function.
Here are some examples:
julia> using DataFrames
julia> df = DataFrame(int=[1, 3, 2, 1, 3, 2],
str=["a", "a", "c", "c", "b", "b"])
6×2 DataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
3 │ 2 c
4 │ 1 c
5 │ 3 b
6 │ 2 b
julia> show(groupby(df, :int), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
Group 2 (2 rows): int = 2
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
Group 3 (2 rows): int = 3
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 a
2 │ 3 b
julia> show(groupby(df, :int; sort=true), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
Group 2 (2 rows): int = 2
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
Group 3 (2 rows): int = 3
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 a
2 │ 3 b
julia> show(groupby(df, :int; sort=false), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
Group 2 (2 rows): int = 3
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 a
2 │ 3 b
Group 3 (2 rows): int = 2
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
julia> show(groupby(df, :str), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
Group 2 (2 rows): str = "c"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 1 c
Group 3 (2 rows): str = "b"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 b
julia> show(groupby(df, :str; sort=true), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
Group 2 (2 rows): str = "b"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 b
Group 3 (2 rows): str = "c"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 1 c
julia> show(groupby(df, :str; sort=false), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
Group 2 (2 rows): str = "c"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 1 c
Group 3 (2 rows): str = "b"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 b
What this example shows is that the key thing you need to remember
to decide about a grouped data frame is the order of groups.
There are two options here:
- groups sorted by the grouping column value, when you pass
sort=true
;
- groups sorted by the order of appearance of values in the source, when you pass
sort=true
.
You might ask what happens if you do not pass the sort
keyword argument?
In this case either of the options is used depending on which one is faster.
Therefore, omitting sort
, can be thought of as an information that the user does not
care about the order of groups but wants the grouping operation to be as fast as possible.
When does the order of groups not matter?
In some cases the order of groups is irrelevant (so you can safely skip passing it).
The most important scenario of this kind is when you use the select
or transform
function
with a GroupedDataFrame
. The reason is that these functions anyway always keep the order of
rows from the source data frame (no matter how the groups are rearranged in a GroupedDataFrame
).
However, it is not the case with combine
, as it respects the order of groups in a GroupedDataFrame
.
Let us see an example highlighting the difference between these cases:
julia> select(groupby(df, :int, sort=true), :str)
6×2 DataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
3 │ 2 c
4 │ 1 c
5 │ 3 b
6 │ 2 b
julia> combine(groupby(df, :int, sort=true), :str)
6×2 DataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
3 │ 2 c
4 │ 2 b
5 │ 3 a
6 │ 3 b
julia> select(groupby(df, :int, sort=false), :str)
6×2 DataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
3 │ 2 c
4 │ 1 c
5 │ 3 b
6 │ 2 b
julia> combine(groupby(df, :int, sort=false), :str)
6×2 DataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
3 │ 3 a
4 │ 3 b
5 │ 2 c
6 │ 2 b
As you can see select
kept the rows in the order in which they are present in df
no matter if we
passed sort=true
or sort=false
. On the other hand combine
returns rows grouped by the groups and
the order of groups corresponds to their order in GroupedDataFrame
, so passing sort=true
or
sort=false
in general changes.
Special operation specification syntax for working with grouped data frames
When discussing select
or combine
in conjunction with GroupedDataFrame
it is important to mention
that there are four special cases of operation specification syntax designed specifically for working with
them. They are:
nrow
to compute the number of rows in each group.
proprow
to compute the proportion of rows in each group.
eachindex
to return a vector holding the number of each row within each group.
groupindices
to return the group number.
Each of them optionally allows you to specify the name of the target column by =>
syntax.
Here are some examples:
julia> combine(groupby(df, :int, sort=false), nrow)
3×2 DataFrame
Row │ int nrow
│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 3 2
3 │ 2 2
julia> combine(groupby(df, :int, sort=false), proprow => "row %")
3×2 DataFrame
Row │ int row %
│ Int64 Float64
─────┼─────────────────
1 │ 1 0.333333
2 │ 3 0.333333
3 │ 2 0.333333
julia> combine(groupby(df, :int, sort=false), eachindex)
6×2 DataFrame
Row │ int eachindex
│ Int64 Int64
─────┼──────────────────
1 │ 1 1
2 │ 1 2
3 │ 3 1
4 │ 3 2
5 │ 2 1
6 │ 2 2
julia> combine(groupby(df, :int, sort=false), groupindices => "group #")
3×2 DataFrame
Row │ int group #
│ Int64 Int64
─────┼────────────────
1 │ 1 1
2 │ 3 2
3 │ 2 3
Iterating a grouped data frame
Apart from using functions such as select
or combine
on a GroupedDataFrame
it is useful to know
that it supports iteration. Therefore you can use a GroupedDataFrame
in a loop or in a comprehension.
When iterated GroupedDataFrame
returns data frames corresponding to the groups. Let us see:
julia> for v in groupby(df, :int, sort=false)
println(v)
end
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 a
2 │ 3 b
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
julia> [v for v in groupby(df, :int, sort=false)]
3-element Vector{SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}}:
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 a
2 │ 3 b
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
julia> collect(groupby(df, :int, sort=false))
3-element Vector{Any}:
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 a
2 │ 3 b
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
The last example has shown you that you can pass a GroupedDataFrame
to a function expecting an iterable, in this case the collect
function. The one exception to this rule is that you cannot use GroupedDataFrame
with the map
function directly:
julia> map(identity, groupby(df, :int, sort=false))
ERROR: ArgumentError: using map over `GroupedDataFrame`s is reserved
The reason is that it was not clear if such operation should produce a vector or a data frame, and it is easy enough to achieve both results with other means. If you want e vector use e.g. a comprehension. If you want a data frame use e.g. combine
or select
.
Advanced iteration
Sometimes, when iterating a GroupedDataFrame
we might be interested not only in a data frame per group, but also in a value of grouping variable. This is easily achieved with the keys
and pairs
functions (depending on whether you only want grouping values or both grouping values and data frames):
julia> map(identity, keys(groupby(df, :int, sort=false)))
3-element Vector{DataFrames.GroupKey{GroupedDataFrame{DataFrame}}}:
GroupKey: (int = 1,)
GroupKey: (int = 3,)
GroupKey: (int = 2,)
julia> map(identity, pairs(groupby(df, :int, sort=false)))
3-element Vector{Pair{DataFrames.GroupKey{GroupedDataFrame{DataFrame}}, SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}}}:
GroupKey: (int = 1,) => 2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
GroupKey: (int = 3,) => 2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 a
2 │ 3 b
GroupKey: (int = 2,) => 2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
I used the map
function to show you that it is only reserved to use it with plain GroupedDataFrame
.
Working with group keys
As you can see in this example each group in a GroupedDataFrame
is associated with a GroupKey
. To get all
keys use the keys
function:
julia> keys(groupby(df, :int, sort=false))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
GroupKey: (int = 1,)
GroupKey: (int = 3,)
GroupKey: (int = 2,)
Let us, as an example extract the last key so see how one can work with it:
julia> key = last(keys(groupby(df, :int, sort=false)))
GroupKey: (int = 2,)
You can get a value of the key by property access or indexing:
julia> key.int
2
julia> key[1]
2
julia> key["int"]
2
julia> key[:int]
2
It is also easy co convert GroupKey
to a dictionary, vector, Tuple
or NamedTuple
if you would need it:
julia> Dict(key)
Dict{Symbol, Int64} with 1 entry:
:int => 2
julia> collect(key)
1-element Vector{Int64}:
2
julia> Tuple(key)
(2,)
julia> NamedTuple(key)
(int = 2,)
Note that, in general, you can group a data frame by multiple columns so you could query value of any grouping column
in the examples above. If you needed to get a list of grouping columns use the groupcols
function:
julia> groupcols(groupby(df, :int, sort=false))
1-element Vector{Symbol}:
:int
Conclusions
In this post we have learned how one can create a grouped data frame and how to choose the order of groups in it.
As a follow-up we have shown how GroupedDataFrame
interacts with functions like select
or combine
.
Next we discussed iterator interface support by GroupedDataFrame
and how to get and use information about
values of grouping columns for each group. I hope you found these examples useful.
In the post next week we will discuss how GroupedDataFrame
supports the indexing interface.