Re-posted from: https://bkamins.github.io/julialang/2023/05/19/pivot.html
Introduction
Creation of pivot tables is a common operation in exploratory data analysis.
Today I want to show you one example how this can be done in DataFrames.jl
that was prompted by a recent discussion on Julia Slack.
The post was written under Julia 1.9.0, Chain 0.5.0, DataFrames.jl 1.5.0, and RDatasets 0.7.7.
The data
We will work with the classical diamonds
dataset. Let us load it first:
julia> using DataFrames
julia> using Chain
julia> using RDatasets
julia> diamonds = RDatasets.dataset("ggplot2", "diamonds")
53940×10 DataFrame
Row │ Carat Cut Color Clarity Depth Table Price X Y Z
│ Float64 Cat… Cat… Cat… Float64 Float64 Int32 Float64 Float64 Float64
───────┼────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
2 │ 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
3 │ 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
4 │ 0.29 Premium I VS2 62.4 58.0 334 4.2 4.23 2.63
5 │ 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
6 │ 0.24 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48
7 │ 0.24 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47
8 │ 0.26 Very Good H SI1 61.9 55.0 337 4.07 4.11 2.53
9 │ 0.22 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
10 │ 0.23 Very Good H VS1 59.4 61.0 338 4.0 4.05 2.39
11 │ 0.3 Good J SI1 64.0 55.0 339 4.25 4.28 2.73
12 │ 0.23 Ideal J VS1 62.8 56.0 340 3.93 3.9 2.46
13 │ 0.22 Premium F SI1 60.4 61.0 342 3.88 3.84 2.33
14 │ 0.31 Ideal J SI2 62.2 54.0 344 4.35 4.37 2.71
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
53928 │ 0.79 Good F SI1 58.1 59.0 2756 6.06 6.13 3.54
53929 │ 0.79 Premium E SI2 61.4 58.0 2756 6.03 5.96 3.68
53930 │ 0.71 Ideal G VS1 61.4 56.0 2756 5.76 5.73 3.53
53931 │ 0.71 Premium E SI1 60.5 55.0 2756 5.79 5.74 3.49
53932 │ 0.71 Premium F SI1 59.8 62.0 2756 5.74 5.73 3.43
53933 │ 0.7 Very Good E VS2 60.5 59.0 2757 5.71 5.76 3.47
53934 │ 0.7 Very Good E VS2 61.2 59.0 2757 5.69 5.72 3.49
53935 │ 0.72 Premium D SI1 62.7 59.0 2757 5.69 5.73 3.58
53936 │ 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.5
53937 │ 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53938 │ 0.7 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53939 │ 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53940 │ 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64
53913 rows omitted
The task we want to do is to analyze the distribution of :Cut
column by :Color
.
Note that these columns are Categorical
(as indicated by the Cat…
information above).
This allows us to check levels of :Cut
and :Color
to verify their ordering.
julia> levels(diamonds.Cut)
5-element Vector{String}:
"Fair"
"Good"
"Very Good"
"Premium"
"Ideal"
julia> levels(diamonds.Color)
7-element Vector{String}:
"D"
"E"
"F"
"G"
"H"
"I"
"J"
Now we are ready to start analyzing the data.
Simple pivot table
A simple pivot table would be to calculate number of observations of each
:Cut
, and :Color
combination. You can do it as follows:
julia> unstack(diamonds, :Cut, :Color, :Cut, combine=length)
5×8 DataFrame
Row │ Cut E I J H F G D
│ Cat… Int64? Int64? Int64? Int64? Int64? Int64? Int64?
─────┼───────────────────────────────────────────────────────────────────
1 │ Ideal 3903 2093 896 3115 3826 4884 2834
2 │ Premium 2337 1428 808 2360 2331 2924 1603
3 │ Good 933 522 307 702 909 871 662
4 │ Very Good 2400 1204 678 1824 2164 2299 1513
5 │ Fair 224 175 119 303 312 314 163
We see that we put the second positional argument of unstack
(:Cut
in our case) as rows,
and the third (:Color
) as columns. The fourth positional argument is what we put in the cells
of the pivot table. Since we want to get the number of observations (combine=length
) then
it does not matter which column we pass so I used :Cut
.
Fixing the order
The table looks nice, but there is one problem with it. The rows and columns are not ordered nicely.
The reason is that currently unstack
in DataFrames.jl orders them by the order of their appearance
in the source data frame.
We can fix it by sorting. The order of columns is set by pre-sorting the source data frame,
and the order of rows is set by post-sorting of the data frame returned by unstack
.
Note that I start using @chain
macro from Chain.jl for clarity of the code:
julia> @chain diamonds begin
sort(:Color)
unstack(:Cut, :Color, :Cut, combine=length)
sort!(:Cut)
end
5×8 DataFrame
Row │ Cut D E F G H I J
│ Cat… Int64? Int64? Int64? Int64? Int64? Int64? Int64?
─────┼───────────────────────────────────────────────────────────────────
1 │ Fair 163 224 312 314 303 175 119
2 │ Good 662 933 909 871 702 522 307
3 │ Very Good 1513 2400 2164 2299 1824 1204 678
4 │ Premium 1603 2337 2331 2924 2360 1428 808
5 │ Ideal 2834 3903 3826 4884 3115 2093 896
Now the table is nicely ordered. Notice that both sort
and sort!
functions are aware of
categorical nature of data and properly sort it.
We almost have what we wanted. The problem is that seeing counts does not allow us to easily
assess the distributions by diamond color. This can be easily added by transforming the columns
of our data frame.
Getting proportions
Let us turn the data from counts to proportions. We can do it using the transform!
function:
julia> @chain diamonds begin
sort(:Color)
unstack(:Cut, :Color, :Cut, combine=length)
sort!(:Cut)
transform(Not(:Cut) .=> x -> x / sum(x), renamecols=false)
end
5×8 DataFrame
Row │ Cut D E F G H I J
│ Cat… Float64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────
1 │ Fair 0.024059 0.0228641 0.0326975 0.0278073 0.0364884 0.0322759 0.0423789
2 │ Good 0.0977122 0.0952332 0.095263 0.0771343 0.0845376 0.0962744 0.10933
3 │ Very Good 0.223321 0.244973 0.226787 0.203595 0.219653 0.222058 0.241453
4 │ Premium 0.236605 0.238542 0.244288 0.258944 0.2842 0.263371 0.287749
5 │ Ideal 0.418303 0.398387 0.400964 0.432519 0.37512 0.38602 0.319088
Indeed J
diamonds seem to have worst :Cut
, and the best are G
diamonds.
Digging deeper into the data
To formally assess the order of columns by :Cut
quality let us turn the data from distribution
to a cumulative distribution first:
julia> @chain diamonds begin
sort(:Color)
unstack(:Cut, :Color, :Cut, combine=length)
sort!(:Cut)
transform!(Not(:Cut) .=> x -> cumsum(x / sum(x)), renamecols=false)
end
5×8 DataFrame
Row │ Cut D E F G H I J
│ Cat… Float64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────────────────────────
1 │ Fair 0.024059 0.0228641 0.0326975 0.0278073 0.0364884 0.0322759 0.0423789
2 │ Good 0.121771 0.118097 0.127961 0.104942 0.121026 0.12855 0.151709
3 │ Very Good 0.345092 0.36307 0.354747 0.308537 0.340679 0.350609 0.393162
4 │ Premium 0.581697 0.601613 0.599036 0.567481 0.62488 0.61398 0.680912
5 │ Ideal 1.0 1.0 1.0 1.0 1.0 1.0 1.0
We would like to order columns by the first-order stochastic dominance relation.
Since DataFrames.jl makes it easier to sort rows, let us permute the dimensions of our data frame first:
julia> @chain diamonds begin
sort(:Color)
unstack(:Cut, :Color, :Cut, combine=length)
sort!(:Cut)
transform!(Not(:Cut) .=> x -> cumsum(x / sum(x)), renamecols=false)
permutedims(:Cut, :Color)
end
7×6 DataFrame
Row │ Color Fair Good Very Good Premium Ideal
│ String Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────
1 │ D 0.024059 0.121771 0.345092 0.581697 1.0
2 │ E 0.0228641 0.118097 0.36307 0.601613 1.0
3 │ F 0.0326975 0.127961 0.354747 0.599036 1.0
4 │ G 0.0278073 0.104942 0.308537 0.567481 1.0
5 │ H 0.0364884 0.121026 0.340679 0.62488 1.0
6 │ I 0.0322759 0.12855 0.350609 0.61398 1.0
7 │ J 0.0423789 0.151709 0.393162 0.680912 1.0
Now we are ready to sort our data frame by all columns except :Color
:
julia> @chain diamonds begin
sort(:Color)
unstack(:Cut, :Color, :Cut, combine=length)
sort!(:Cut)
transform!(Not(:Cut) .=> x -> cumsum(x / sum(x)), renamecols=false)
permutedims(:Cut, :Color)
sort!(Not(:Color))
end
7×6 DataFrame
Row │ Color Fair Good Very Good Premium Ideal
│ String Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────
1 │ E 0.0228641 0.118097 0.36307 0.601613 1.0
2 │ D 0.024059 0.121771 0.345092 0.581697 1.0
3 │ G 0.0278073 0.104942 0.308537 0.567481 1.0
4 │ I 0.0322759 0.12855 0.350609 0.61398 1.0
5 │ F 0.0326975 0.127961 0.354747 0.599036 1.0
6 │ H 0.0364884 0.121026 0.340679 0.62488 1.0
7 │ J 0.0423789 0.151709 0.393162 0.680912 1.0
The sorting exercise did not work this time.
First-order stochastic dominance does not render the best and the worst option.
Note that e.g. options E
and G
do not dominate each other.
However we see that option J
is clearly worst as it has maximum values in all levels.
To see this more clearly let us do min-max scaling of all columns except :Ideal
(as it is constant):
julia> scale(x) = (x .- minimum(x)) / (maximum(x) - minimum(x))
scale (generic function with 1 method)
julia> @chain diamonds begin
sort(:Color)
unstack(:Cut, :Color, :Cut, combine=length)
sort!(:Cut)
transform!(Not(:Cut) .=> x -> cumsum(x / sum(x)), renamecols=false)
permutedims(:Cut, :Color)
sort!(Not(:Color))
transform!(Not([:Color, :Ideal]) .=> scale .=> identity)
end
7×6 DataFrame
Row │ Color Fair Good Very Good Premium Ideal
│ String Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────
1 │ E 0.0 0.281301 0.644408 0.300901 1.0
2 │ D 0.0612305 0.359855 0.431965 0.125328 1.0
3 │ G 0.253303 0.0 0.0 0.0 1.0
4 │ I 0.482289 0.504808 0.497151 0.409932 1.0
5 │ F 0.503895 0.492198 0.546059 0.278184 1.0
6 │ H 0.698153 0.343921 0.379817 0.506022 1.0
7 │ J 1.0 1.0 1.0 1.0 1.0
Indeed J
color has 1.0
values in all columns.
We also now can more clearly see that if we ignore the :Fair
level
the G
color dominates all other colors.
Note that in the last step I have shown an alternative to renamecols=false
of how one can keep the column names unchanged under transformation.
What you can do is pass the identity
function as target column name.
The reason is that if you pass a function as target column name then this function is applied
to source column name (and identity
keeps things as they were).
Conclusions
I hope you found this post useful for exploring some of the functionalities
of DataFrames.jl.
I also tried to show how nicely @chain
can be used to gradually build
an analysis. This is especially convenient in Julia REPL, since when you
go back in command history it allows you to get whole last command
(and not just a single line like in some other REPLs)
in the prompt with one up arrow key press and edit it.