Pivot tables in DataFrames.jl 1.4

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/11/18/unstack.html

Introduction

Some time ago I have written a post about the unstack function
in DataFrames.jl. Since DataFrames.jl 1.4 release this function has received
a major update of offered functionality. Currently it is possible to
easily create pivot tables using unstack as I am going to show you today.

The example we will use is simulation analysis of graphs, as in May 2023
I am co-organizing
18th Workshop on Algorithms and Models for the Web Graph.
If you are interested in this topic please come and join us. If you would
like to present your work, here is the Call For Papers.

The post was written under Julia 1.8.2, Graphs.jl 1.7.4,
ProgressMeter.jl 1.7.2, Plots.jl 1.36.2, and DataFrames.jl 1.4.3.

Generating the data

We will consider Erdős–Rényi random graphs. These graphs take two
parameters n (the number of nodes in the graph) and p edge probability
(independently for each edge). It can be calculated thus, that the expected
number of edges in this graph is pn(n-1)/2, so expected average node degree
is d=p(n-1). So, given n and d we can calculate p=d/(n-1). In Graphs.jl
you can generate these graphs using the erdos_renyi function.

Connected components of a graph are equivalence classes of a
reachability relation between pairs of nodes. Informally: in a connected
component all nodes can be reached from each other and no nodes outside
from a connected component can be reached from it.

Let me explain it by example:

julia> using Graphs

julia> using Random

julia> Random.seed!(1234);

julia> er = erdos_renyi(7, 0.2)
{7, 4} undirected simple Int64 graph

julia> collect(edges(er))
4-element Vector{Graphs.SimpleGraphs.SimpleEdge{Int64}}:
 Edge 1 => 6
 Edge 3 => 7
 Edge 4 => 6
 Edge 5 => 6

julia> connected_components(er)
3-element Vector{Vector{Int64}}:
 [1, 4, 5, 6]
 [2]
 [3, 7]

We created a random graph on 7 nodes with edge probability equal to 0.2.
It has 4 edges. Note, that our graph is undirected, so although Graphs.jl
displays edges like this: 1 => 6, the edges are bi-directional.
We note that node 6 is connected to nodes 1, 4, and 5. So they
form one connected component. There is also an edge between nodes 3 and 7,
so this is a second connected component. Finally node 2 is isolated, giving
us a third component. We can easily find the connected components of a graph
using the connected_components function.

We will want to investigate how the size of the largest connected component
of the Erdős–Rényi random graph depends on n and d using simulation.

We generate the data for n having values [100, 1000, 10_000, 100_000]
and d having values 0.5:0.1:2.0. For each combination
of values we repeat the experiment 64 times.

First create an initial data frame with the setup of the experiment:

julia> using DataFrames

julia> df = allcombinations(DataFrame,
                            d=0.5:0.1:2.0,
                            n=[100, 1000, 10_000, 100_000],
                            rep=1:64)
4096×3 DataFrame
  Row │ d        n       rep   
      │ Float64  Int64   Int64 
──────┼────────────────────────
    1 │     0.5     100      1
    2 │     0.6     100      1
    3 │     0.7     100      1
    4 │     0.8     100      1
    5 │     0.9     100      1
  ⋮   │    ⋮       ⋮       ⋮
 4092 │     1.6  100000     64
 4093 │     1.7  100000     64
 4094 │     1.8  100000     64
 4095 │     1.9  100000     64
 4096 │     2.0  100000     64
              4086 rows omitted

Now we compute p for each experiment:

julia> df.p = df.d ./ (df.n .- 1);

julia> df
4096×4 DataFrame
  Row │ d        n       rep    p
      │ Float64  Int64   Int64  Float64     
──────┼─────────────────────────────────────
    1 │     0.5     100      1  0.00505051
    2 │     0.6     100      1  0.00606061
    3 │     0.7     100      1  0.00707071
    4 │     0.8     100      1  0.00808081
    5 │     0.9     100      1  0.00909091
  ⋮   │    ⋮       ⋮       ⋮         ⋮
 4092 │     1.6  100000     64  1.60002e-5
 4093 │     1.7  100000     64  1.70002e-5
 4094 │     1.8  100000     64  1.80002e-5
 4095 │     1.9  100000     64  1.90002e-5
 4096 │     2.0  100000     64  2.00002e-5
                           4086 rows omitted

Finally for each row we compute the fraction of nodes in the largest component
(the computation takes some time, so I use @showprogress to monitor it):

julia> using ProgressMeter

julia> df.largest = @showprogress map(eachrow(df)) do row
           er = erdos_renyi(row.n, row.p)
           cc = connected_components(er)
           return maximum(length, cc) / row.n
       end;
Progress: 100%|███████████████████████████████████████████████| Time: 0:00:39

julia> df
4096×5 DataFrame
  Row │ d        n       rep    p            largest 
      │ Float64  Int64   Int64  Float64      Float64 
──────┼──────────────────────────────────────────────
    1 │     0.5     100      1  0.00505051   0.04
    2 │     0.6     100      1  0.00606061   0.07
    3 │     0.7     100      1  0.00707071   0.22
    4 │     0.8     100      1  0.00808081   0.29
    5 │     0.9     100      1  0.00909091   0.35
  ⋮   │    ⋮       ⋮       ⋮         ⋮          ⋮
 4092 │     1.6  100000     64  1.60002e-5   0.64289
 4093 │     1.7  100000     64  1.70002e-5   0.69333
 4094 │     1.8  100000     64  1.80002e-5   0.73421
 4095 │     1.9  100000     64  1.90002e-5   0.77021
 4096 │     2.0  100000     64  2.00002e-5   0.79805
                                    4086 rows omitted

Now we are ready to analyze this data using unstack.

Analyzing the results

First let me recall the general structure of the unstack call
(please refer to its documentation for more details):

unstack([data_frame],
        [columns to use as row keys],
        [column storing column names],
        [column storing the values];
        combine=[function to apply to the values])

The structure is easy to remember. After a data frame, we sequentially pass
what we want in rows, what we want in columns, what values we want to unstack,
and finally, a combine keyword argument (introduced in DataFrames.jl 1.4)
specifies what operation we want to apply to these values.

By default combine is only, which means that we want a unique value in
row-column combination and otherwise we get an error. Let us check how
unstack works by default by using [:d, :rep] for rows, :n for columns,
and :largest for values:

julia> unstack(df, [:d, :rep], :n, :largest)
1024×6 DataFrame
  Row │ d        rep    100       1000      10000     100000   
      │ Float64  Int64  Float64?  Float64?  Float64?  Float64? 
──────┼────────────────────────────────────────────────────────
    1 │     0.5      1      0.04     0.009    0.0017   0.00021
    2 │     0.6      1      0.07     0.046    0.0024   0.00035
    3 │     0.7      1      0.22     0.011    0.0042   0.00048
    4 │     0.8      1      0.29     0.019    0.01     0.00139
    5 │     0.9      1      0.35     0.027    0.0109   0.00166
  ⋮   │    ⋮       ⋮       ⋮         ⋮         ⋮         ⋮
 1020 │     1.6     64      0.63     0.665    0.6419   0.64289
 1021 │     1.7     64      0.69     0.683    0.6941   0.69333
 1022 │     1.8     64      0.33     0.746    0.7389   0.73421
 1023 │     1.9     64      0.84     0.761    0.7696   0.77021
 1024 │     2.0     64      0.83     0.818    0.7991   0.79805
                                               1104 rows omitted

Everything worked, because for each cell we had a unique combination of row and
column keys. However, if we, for example, dropped :rep, we would get an
error:

julia> unstack(df, :d, :n, :largest)
ERROR: ArgumentError: Duplicate entries in unstack
at row 65 for key (0.5,) and variable 100.
Pass `combine` keyword argumen t to specify how they should be handled.

Let us start with passing combine=copy to just store all values of
:largest per row-column combination:

julia> unstack(df, :d, :n, :largest, combine=copy)
16×5 DataFrame
 Row │ d        100                                1000                              
     │ Float64  Array…?                            Array…?                          
─────┼───────────────────────────────────────────────────────────────────
   1 │     0.5  [0.04, 0.06, 0.05, 0.05, 0.08, 0…  [0.009, 0.015, 0.009,
   2 │     0.6  [0.07, 0.07, 0.05, 0.07, 0.04, 0…  [0.046, 0.013, 0.018,
   3 │     0.7  [0.22, 0.13, 0.05, 0.09, 0.18, 0…  [0.011, 0.019, 0.024,
   4 │     0.8  [0.29, 0.11, 0.08, 0.09, 0.1, 0.…  [0.019, 0.068, 0.018,
   5 │     0.9  [0.35, 0.19, 0.19, 0.19, 0.11, 0…  [0.027, 0.029, 0.057,
   6 │     1.0  [0.42, 0.12, 0.15, 0.22, 0.16, 0…  [0.11, 0.068, 0.042, 
   7 │     1.1  [0.26, 0.32, 0.23, 0.69, 0.36, 0…  [0.146, 0.085, 0.266,
   8 │     1.2  [0.45, 0.09, 0.35, 0.42, 0.25, 0…  [0.232, 0.175, 0.338,
   9 │     1.3  [0.49, 0.47, 0.12, 0.17, 0.67, 0…  [0.376, 0.504, 0.434,
  10 │     1.4  [0.36, 0.45, 0.49, 0.49, 0.47, 0…  [0.448, 0.527, 0.443,
  11 │     1.5  [0.72, 0.6, 0.65, 0.57, 0.69, 0.…  [0.577, 0.474, 0.581,
  12 │     1.6  [0.31, 0.77, 0.41, 0.42, 0.73, 0…  [0.639, 0.584, 0.644,
  13 │     1.7  [0.78, 0.62, 0.68, 0.7, 0.67, 0.…  [0.673, 0.651, 0.653,
  14 │     1.8  [0.86, 0.91, 0.7, 0.71, 0.76, 0.…  [0.721, 0.739, 0.748,
  15 │     1.9  [0.8, 0.76, 0.79, 0.83, 0.76, 0.…  [0.791, 0.794, 0.82, 
  16 │     2.0  [0.7, 0.74, 0.87, 0.88, 0.83, 0.…  [0.809, 0.776, 0.792,
                                                        3 columns omitted

Sometimes such operation is useful, but often we are interested in some
aggregates.

First let us check if indeed the data was generated correctly, i.e. if the
grid indeed has 64 experiments per combination of parameters. For this we use
the combine=length keyword argument (as we just want to check the number
of elements per dn combination:

julia> unstack(df, :d, :n, :largest, combine=length)
16×5 DataFrame
 Row │ d        100     1000    10000   100000 
     │ Float64  Int64?  Int64?  Int64?  Int64?
─────┼─────────────────────────────────────────
   1 │     0.5      64      64      64      64
   2 │     0.6      64      64      64      64
   3 │     0.7      64      64      64      64
   4 │     0.8      64      64      64      64
   5 │     0.9      64      64      64      64
   6 │     1.0      64      64      64      64
   7 │     1.1      64      64      64      64
   8 │     1.2      64      64      64      64
   9 │     1.3      64      64      64      64
  10 │     1.4      64      64      64      64
  11 │     1.5      64      64      64      64
  12 │     1.6      64      64      64      64
  13 │     1.7      64      64      64      64
  14 │     1.8      64      64      64      64
  15 │     1.9      64      64      64      64
  16 │     2.0      64      64      64      64

All looks good. So now we can check the mean and standard deviation of size of
the largest connected component for each combination:

julia> using Statistics

julia> unstack(df, :d, :n, :largest, combine=mean)
16×5 DataFrame
 Row │ d        100        1000       10000       100000      
     │ Float64  Float64?   Float64?   Float64?    Float64?
─────┼────────────────────────────────────────────────────────
   1 │     0.5  0.0525     0.0109375  0.00178125  0.000269219
   2 │     0.6  0.0776562  0.0159844  0.00266563  0.00040125
   3 │     0.7  0.0982813  0.0222969  0.00409688  0.000635625
   4 │     0.8  0.134375   0.031875   0.00668906  0.00121281
   5 │     0.9  0.161562   0.0485625  0.0136969   0.00312172
   6 │     1.0  0.205937   0.0905312  0.0445984   0.0187338
   7 │     1.1  0.265781   0.150547   0.149169    0.175325
   8 │     1.2  0.355625   0.272266   0.307183    0.313707
   9 │     1.3  0.375156   0.403047   0.421645    0.423478
  10 │     1.4  0.455313   0.503437   0.507891    0.510792
  11 │     1.5  0.594062   0.57925    0.580353    0.582628
  12 │     1.6  0.596719   0.63575    0.6428      0.641092
  13 │     1.7  0.697656   0.689906   0.691744    0.691409
  14 │     1.8  0.738281   0.730672   0.732986    0.732874
  15 │     1.9  0.76125    0.763125   0.768134    0.767024
  16 │     2.0  0.8025     0.797609   0.796216    0.796993

julia> agg_std = unstack(df, :d, :n, :largest, combine=std)
16×5 DataFrame
 Row │ d        100        1000        10000        100000      
     │ Float64  Float64?   Float64?    Float64?     Float64?
─────┼──────────────────────────────────────────────────────────
   1 │     0.5  0.0184305  0.00263598  0.000366829  4.57843e-5
   2 │     0.6  0.0422292  0.00599071  0.000623411  8.65108e-5
   3 │     0.7  0.0538238  0.00819539  0.00102446   0.000125444
   4 │     0.8  0.0635928  0.0138844   0.00234887   0.000429261
   5 │     0.9  0.0928426  0.0202656   0.00613093   0.00137704
   6 │     1.0  0.102581   0.0406045   0.026229     0.0111048
   7 │     1.1  0.141442   0.0748327   0.058437     0.0145289
   8 │     1.2  0.15247    0.0863053   0.0289068    0.00968446
   9 │     1.3  0.15323    0.0694691   0.0170002    0.00647732
  10 │     1.4  0.148399   0.0527765   0.0118846    0.00492798
  11 │     1.5  0.118987   0.0439715   0.0121344    0.00403464
  12 │     1.6  0.150407   0.0286661   0.00957977   0.00351468
  13 │     1.7  0.113833   0.0408923   0.00952214   0.00307108
  14 │     1.8  0.0969627  0.021692    0.00890106   0.00240749
  15 │     1.9  0.0829324  0.0224821   0.00788557   0.00210828
  16 │     2.0  0.0755929  0.0213895   0.0066356    0.00233577

What we can notice (I will not go into mathematics of these results; however,
if you find such computations interesting please be sure to join us during
WAW2023 conference):

  • as we increase average node degree the average size of the largest component
    increases;
  • there seems to be a sharp change for average degree above 1; it is especially
    visible as we increase number of nodes in the graph;
  • the results have largest standard deviation around average degree equal to 1;
    it drops for low and high values of average degree; also the standard
    deviation in general decreases as we increase number of nodes in the graph.

Let me additionally visualize the last observation:

julia> using Plots

julia> plot(agg_std.d, Matrix(agg_std[:, 2:end]),
            xlabel="average degree",
            ylabel="std of largest component relative size",
            labels=permutedims(names(agg_std, Not(:d))))

Here is a generated plot:

Standard deviation of largest component relative size

Conclusions

I hope you will find the new combine capabilities of unstack useful.

In DataFrames.jl 1.5 release we plan to add even more capabilities to unstack
(like allowing to use multiple columns to generate columns, or allowing to
process multiple value columns). I will post about it when it is done.