Re-posted from: https://bkamins.github.io/julialang/2024/03/08/gdf.html
Introduction
This is a follow up to the post from last week. We will continue
discussing how one can work with GroupedDataFrame
objects in DataFrames.jl.
Today we focus on indexing of grouped data frames.
The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.
Warm-up: getting group indices
First create some grouped data frame:
julia> using DataFrames
julia> df = DataFrame(int=[1, 3, 2, 1, 3, 2],
str=["a", "a", "c", "c", "b", "b"])
6×2 DataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
3 │ 2 c
4 │ 1 c
5 │ 3 b
6 │ 2 b
julia> gdf = groupby(df, :str, sort=true)
GroupedDataFrame with 3 groups based on key: str
First Group (2 rows): str = "a"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
⋮
Last Group (2 rows): str = "c"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 1 c
It is sometimes useful to learn what is a group number of each row of the source data frame df
in a grouped data frame gdf
.
You can easily get this information with groupindices
:
julia> groupindices(gdf)
6-element Vector{Union{Missing, Int64}}:
1
1
3
3
2
2
Extracting a single group
A basic operation when indexing a GroupedDataFrame
is to pick a group by its number. Here is an example:
julia> gdf[1]
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
julia> gdf[2]
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 b
julia> gdf[3]
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 1 c
Note, that gdf
behaves similarly to a vector. You can even use begin
and end
in indexing:
julia> gdf[begin]
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
julia> gdf[end]
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 1 c
Often you might want to extract a group not by its position in gdf
, but by the value of the grouping
variable or variables. In this case you can use GroupKey
, dictionary, tuple, or named tuple to achieve this.
Let us check how it works. Start with dictionary, tuple, and named tuple:
julia> gdf[Dict("str" => "b")] # dictionary
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 b
julia> gdf[("b",)] # tuple
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 b
julia> gdf[(; str="b")] # named tuple
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 b
With GroupKey
we first need to get it from keys
, but everything else works the same:
julia> key = keys(gdf)[1]
GroupKey: (str = "a",)
julia> gdf[key]
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
You might ask why we require passing grouping variable in a container (dictionary, tuple, named tuple, GroupKey
)
and not directly pass the required value when indexing? The reason is that if you grouped your data by integer column
the result would be ambiguous. Here is an example showing that under the defined rules there is no such ambiguity:
julia> gdf2 = groupby(df, :int, sort=false)
GroupedDataFrame with 3 groups based on key: int
First Group (2 rows): int = 1
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
⋮
Last Group (2 rows): int = 2
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
julia> gdf2[3] # third group
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
julia> gdf2[(3, )] # group with value of the grouping variable equal to 3
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 a
2 │ 3 b
Extracting multiple groups
You now know how to pick a single group, so selecting multiple groups is a natural next step.
You can use a collection of any of the selectors we have already discussed. Here are some examples:
julia> gdf[[3, 1]] # selection by group number
GroupedDataFrame with 2 groups based on key: str
First Group (2 rows): str = "c"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 1 c
⋮
Last Group (2 rows): str = "a"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
julia> gdf[[("c",), ("a",)]] # selection by grouping variable value
GroupedDataFrame with 2 groups based on key: str
First Group (2 rows): str = "c"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 1 c
⋮
Last Group (2 rows): str = "a"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
Note that indexing allows both for reordering and for dropping groups, which often comes handy when analyzing data.
Also note that groupindices
is aware of such changes:
julia> groupindices(gdf[[3, 1]])
6-element Vector{Union{Missing, Int64}}:
2
2
1
1
missing
missing
Here group with "c"
is first, with "a"
is second and with "b"
is dropped, so missing
is returned in the produced vector.
It is also worth to remember that subset
and filter
can be used with GroupedDataFrames
. This topic is discussed in this post.
Key lookup
Sometimes we do not want to index into a grouped data frame, but just check if it contains some key. This is easily achievable with the haskey
function:
julia> haskey(gdf, ("a",))
true
julia> haskey(gdf, ("z",))
false
Conclusions
In this post we discussed indexing of GroupedDataFrames
. This concludes the basic tutorial of working with these data structures.
I hope you will find the functionalities I have covered useful in your work.