Re-posted from: https://bkamins.github.io/julialang/2022/07/22/juliacon2022.html
Introduction
During JuliaCon 2022 I will run a
tutorial on DataFrames.jl.
In the tutorial I will focus on ways you can write transformation operations
using the select
/transform
/combine
functions and the operation
specification syntax.
In this post I want to give you a preview of the topics I am going to cover
in my tutorial.
The post was written under Julia 1.7.2, DataFrames.jl 1.3.4,
and DataFramesMeta.jl 0.12.
What is operation specification syntax?
If you are new to DataFrames.jl then you probably wonder what
operation specification syntax is. Fortunately it is quite easy.
If you want to transform data using one of the select
/transform
/combine
functions you can specify the transformation you want to perform using the
following general syntax:
[source columns] => [function] => [output columns]
For example if you write :a => mean => :b
you will get a mean of column :a
and store it in output data frame in column :b
(visually: input column :a
is
passed to the mean
function whose output is passed to output column :b
).
Additionally, if you prefer assignment style of specifying operations you can
use DataFramesMeta.jl package that allows you to write the same as just
:b = mean(:a)
. To use DataFramesMeta.jl you need to prefix the appropriate
function name with @
(to turn it into a macro).
Let me show you a minimal working example of such a transformation.
We compute mean of column :val
by groups defined by the :id
column:
julia> using DataFramesMeta
julia> using Statistics
julia> df = DataFrame(id=[1, 2, 1, 2, 1, 2], val=1:6)
6×2 DataFrame
Row │ id val
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 1 3
4 │ 2 4
5 │ 1 5
6 │ 2 6
julia> combine(groupby(df, :id), :val => mean => :mean_val)
2×2 DataFrame
Row │ id mean_val
│ Int64 Float64
─────┼─────────────────
1 │ 1 3.0
2 │ 2 4.0
julia> @combine(groupby(df, :id), :mean_val = mean(:val))
2×2 DataFrame
Row │ id mean_val
│ Int64 Float64
─────┼─────────────────
1 │ 1 3.0
2 │ 2 4.0
This is a simple example of what operation specification syntax
can do. In this post let me give you a more complex example (I explain
all the details of how it works in my upcoming tutorial).
The question from StackOverflow
In this StackOverflow question the user wanted to analyze iris
data
set and get 25% and 75% quantiles of the Sepal.Length
column.
The R code that the StackOverflow user provided was:
> library(dplyr)
> iris %>%
+ group_by(Species) %>%
+ summarise(
+ quantile(Sepal.Length, c(.25, .75)) %>%
+ matrix(nrow = 1) %>%
+ as.data.frame() %>%
+ setNames(paste0("Sepal.Length", c(.25, .75)))
+ )
# A tibble: 3 x 3
Species Sepal.Length0.25 Sepal.Length0.75
<fct> <dbl> <dbl>
1 setosa 4.8 5.2
2 versicolor 5.6 6.3
3 virginica 6.22 6.9
The question was how to achieve the same with DataFrames.jl and
DataFramesMeta.jl?
Here is a solution. First we need to load the iris
dataset (in the code I take
advantage of the fact that this dataset is bundled into DataFrames.jl
installation folders):
julia> using CSV
julia> iris = CSV.read(joinpath(dirname(pathof(DataFrames)),
"..", "docs", "src", "assets", "iris.csv"),
DataFrame)
150×5 DataFrame
Row │ SepalLength SepalWidth PetalLength PetalWidth Species
│ Float64 Float64 Float64 Float64 String15
─────┼──────────────────────────────────────────────────────────────────
1 │ 5.1 3.5 1.4 0.2 Iris-setosa
2 │ 4.9 3.0 1.4 0.2 Iris-setosa
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
149 │ 6.2 3.4 5.4 2.3 Iris-virginica
150 │ 5.9 3.0 5.1 1.8 Iris-virginica
146 rows omitted
Now we are ready to use the combine
function and the operation specification
syntax:
julia> combine(groupby(iris, :Species),
:SepalLength =>
(x -> [quantile(x, [0.25, 0.75])]) =>
string.("SepalLength", [0.25, 0.75]))
3×3 DataFrame
Row │ Species SepalLength0.25 SepalLength0.75
│ String15 Float64 Float64
─────┼───────────────────────────────────────────────────
1 │ Iris-setosa 4.8 5.2
2 │ Iris-versicolor 5.6 6.3
3 │ Iris-virginica 6.225 6.9
With DataFramesMeta.jl you would write this using the @combine
macro as
follows (I additionally show here how to use operation chaining with @chain
):
julia> @chain iris begin
groupby(:Species)
@combine($["SepalLength0.25", "SepalLength0.75"] = [quantile(:SepalLength, [0.25, 0.75])])
end
3×3 DataFrame
Row │ Species SepalLength0.25 SepalLength0.75
│ String15 Float64 Float64
─────┼───────────────────────────────────────────────────
1 │ Iris-setosa 4.8 5.2
2 │ Iris-versicolor 5.6 6.3
3 │ Iris-virginica 6.225 6.9
Conclusions
The operation specification syntax was designed to allow doing simple
transformations in an easy way, but at the same to also support quite complex
operations, like the one we did on the iris
data frame.
If you would like to hear a detailed explanation of how to write such code
please join me during the upcoming workshop.
You can find all the examples that I will use during the workshop in the
accompanying GitHub repository.