Author Archives: Blog by Bogumił Kamiński

Filtering grouped data frames in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/06/02/filter.html

Introduction

In DataFrames.jl a data frame can be grouped by columns.
In this way its view is created. The type of this view
is GroupedDataFrame and it allows for performing groupwise
operations on the data stored in the original data frame.

If you use SQL then you can think of GroupedDataFrame
as an object with is obtained by using GROUP BY command
but without performing an aggregation step. This is often
useful as after grouping data you might want to perform
several different aggregation operations on it without
having to group it each time.

In this post I want to discuss how filtering of
GroupedDataFrame works in DataFrames.jl.
In general there are two ways how you might want to filter it:

  • by dropping whole groups;
  • by dropping rows within group.

I will discuss both today.

The post was written under Julia 1.9.0 and DataFrames.jl 1.5.0.

Preparing the data

Start by creating a sample GroupedDataFrame:

julia> using DataFrames

julia> df = DataFrame(x = repeat(1:4, 2), id=1:8)
8×2 DataFrame
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     4      4
   5 │     1      5
   6 │     2      6
   7 │     3      7
   8 │     4      8

julia> gdf = groupby(df, :x, sort=true);

julia> show(gdf, allgroups=true)
GroupedDataFrame with 4 groups based on key: x
Group 1 (2 rows): x = 1
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      5
Group 2 (2 rows): x = 2
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
   2 │     2      6
Group 3 (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7
Group 4 (2 rows): x = 4
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     4      4
   2 │     4      8

Dropping whole groups

The first type of operation is when you want to drop whole groups
from your grouped data frame.

First, assume that you want to drop
the first and last group from gdf. You can achieve this using indexing:

julia> gdf[[2, 3]]
GroupedDataFrame with 2 groups based on key: x
First Group (2 rows): x = 2
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
   2 │     2      6
⋮
Last Group (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7

julia> gdf[Not(begin, end)]
GroupedDataFrame with 2 groups based on key: x
First Group (2 rows): x = 2
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
   2 │     2      6
⋮
Last Group (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7

Now assume that you want to drop groups in which the value of the grouping
variable :x is even. You could still use indexing like this:

julia> gdf[isodd.(getproperty.(keys(gdf), :x))]
GroupedDataFrame with 2 groups based on key: x
First Group (2 rows): x = 1
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      5
⋮
Last Group (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7

However, a more general approach is to use the filter function:

julia> filter(sdf -> isodd(first(sdf.x)), gdf)
GroupedDataFrame with 2 groups based on key: x
First Group (2 rows): x = 1
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      5
⋮
Last Group (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7

If you wanted to drop grouping of the result you can pass ungroup=true in filter:

julia> filter(sdf -> isodd(first(sdf.x)), gdf, ungroup=true)
4×2 DataFrame
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      5
   3 │     3      3
   4 │     3      7

Dropping rows within group

Now that you know how to drop whole groups you might wonder how to drop individual rows
within group. Assume that we want to drop from each group all rows except
the row with minimal value of :id within group. You can achieve this using the
subset function:

julia> subset(gdf, :id => x -> x .== minimum(x))
4×2 DataFrame
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     4      4

Note that by default subset ungroups data. If you want to keep it grouped
pass ungroup=false:

julia> show(subset(gdf, :id => x -> x .== minimum(x), ungroup=false),
            allgroups=true)
GroupedDataFrame with 4 groups based on key: x
Group 1 (1 row): x = 1
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
Group 2 (1 row): x = 2
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
Group 3 (1 row): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
Group 4 (1 row): x = 4
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     4      4

Conclusions

I hope you found the examples using indexing, filter, and subset
useful and they improved your understanding of row filtering options
available in DataFrames.jl.

In particular remember that filter and subset have a different
default value of the ungoup keyword argument. This difference was
made to reflect the fact that when doing filter typically we do not
want to ungroup data, while when doing subset typically we want
to drop grouping.

Lessons learned from WAW2023 conference

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/05/26/waw2023.html

Introduction

Today is the last day of WAW2023 conference.
It was a great experience to attend it. I had an opportunity
to meet some of the top experts in analyzing and modeling graphs
and hypergraphs. In particular, it was really nice to listen
to several talks on tools available in the Julia language
for processing such data.

Out of many problems discussed during the conference I picked
one that I particularly like and would like to cover it today.

The problem

The task was proposed by Lasse Leskelä from Aalto University.

The problem is formulated as follows. Assume we have two types
of hockey attackers: wingers and centers. There are 300 players
in total out of which 100 are wingers and 200 are centers.

The players go to ice in triplets. The probability that a given
triplet scores a goal in a single action depends on its composition
and is as follows:

  • winger-winger-winger: 0.24
  • winger-winger-center: 0.40
  • winger-center-center: 0.32
  • center-center-center: 0.36

As you can see the best composition is to have one center and two wingers,
and the worst composition of the players is three wingers.

Now assume that:

  • you do not know who is a center and who is a winger;
  • you do not know what is the fraction of centers and wingers in the data;
  • you do not know the scoring probabilities.

The question is if you are given the information about scoring performance
of all triplets that can be formed from these 300 players (each triplet
is allowed to play exactly one action) can you identify from this data that
there are indeed two groups of players and correctly identify them.

Why is this problem interesting?

A natural question you might ask what is the expected proportion of
goals scored by a center and by a winger in our data. Let us calculate it.

If I am a center then my probability of scoring is:

1/9 * 0.40 + 4/9 * 0.32 + 4/9 * 0.36 = 26/75 = 34.(6)%

The probabilities 1/9, 4/9 and 4/9 correspond to probability
that a player is matched with winger-winger, winger-center, and center-center
combination respectively (if you are not sure about them we will soon
check them with data).

The same calculation for winger yields:

1/9 * 0.24 + 4/9 * 0.40 + 4/9 * 0.32 = 26/75 = 34.(6)%

As you can see wingers and centers score with the same
probability on the average in our data.

Therefore we cannot expect to be able to distinguish centers
and wingers by just looking at their average scoring.

Maybe then looking at pairs of players could help us?
Let us check what is the scoring probability of winger-winger,
winger-center, and center-center pairs. They are as follows:

For winger-winger we get:

2/3 * 0.4 + 1/3 * 0.24 = 26/74 = 34.(6)%

For winger-center:

2/3 * 0.32 + 1/3 * 0.40 = 26/74 = 34.(6)%

And for center-center:

2/3 * 0.36 + 1/3 * 0.32 = 26/74 = 34.(6)%

Unfortunately looking at pairs of players still would not give
us enough insight to solve the problem. It turns our that
we will need to take into account three-way interactions between
players to be able to hope to solve our problem.

In graph perspective we can look at this problem as follows.
Players are nodes and their triplets that scored form a three
element hyperedge. What we have learned above is that to solve
the problem we need to use some method that is hyperege-aware.

Preparing the data

To set up the stage start with generating sample data for our problem.
I the code I encode wingers with 0 and centers with 1.
Therefore if we take three players the sum of their codes ranges
from 0 to 3 and encodes the composition of the attacking team.

First load the packages:

julia> using Random

julia> using Graphs

julia> using DataFrames

julia> using Plots

julia> using Chain

julia> using Statistics

julia> import MultivariateStats

Next define a function result that maps player type triplet to a boolean
value that tells us if a given set of players scored (with probabilities
I described above):

julia> prob(t1, t2, t3) = (0.24, 0.40, 0.32, 0.36)[t1 + t2 + t3 + 1]
prob (generic function with 1 method)

julia> result(t1, t2, t3) = rand() < prob(t1, t2, t3)
result (generic function with 1 method)

Now we are ready to generate players. Start with assigning
each player a random value of winger and center:

julia> ids = shuffle!([fill(0, n_winger); fill(1, n_center)])
300-element Vector{Int64}:
 1
 1
 0
 ⋮

 1
 0
 1

We are now ready to generate the table holding all player triplets.
Note that I want to ensure that each triplet is stored in it only
once (with respect to permutation) so I filter it ensuring that
player numbers are increasing:

julia> df = allcombinations(DataFrame, x1=1:300, x2=1:300, x3=1:300)
27000000×3 DataFrame
      Row │ x1     x2     x3
          │ Int64  Int64  Int64
──────────┼─────────────────────
        1 │     1      1      1
        2 │     2      1      1
        3 │     3      1      1
        4 │     4      1      1
    ⋮     │   ⋮      ⋮      ⋮
 26999997 │   297    300    300
 26999998 │   298    300    300
 26999999 │   299    300    300
 27000000 │   300    300    300
           26999992 rows omitted

julia> subset!(df, [[:x1, :x2], [:x2, :x3]] .=> ByRow(<))
4455100×3 DataFrame
     Row │ x1     x2     x3
         │ Int64  Int64  Int64
─────────┼─────────────────────
       1 │     1      2      3
       2 │     1      2      4
       3 │     1      3      4
       4 │     2      3      4
    ⋮    │   ⋮      ⋮      ⋮
 4455097 │   295    299    300
 4455098 │   296    299    300
 4455099 │   297    299    300
 4455100 │   298    299    300
           4455092 rows omitted

Now for each triplet we add the player type information
and next call the result function to randomly decide
if a given player pair scored or not:

julia> transform!(df, All() .=> (x -> ids[x]) .=> [:t1, :t2, :t3])
4455100×6 DataFrame
     Row │ x1     x2     x3     t1     t2     t3
         │ Int64  Int64  Int64  Int64  Int64  Int64
─────────┼──────────────────────────────────────────
       1 │     1      2      3      1      1      0
       2 │     1      2      4      1      1      1
       3 │     1      3      4      1      0      1
       4 │     2      3      4      1      0      1
    ⋮    │   ⋮      ⋮      ⋮      ⋮      ⋮      ⋮
 4455097 │   295    299    300      0      0      1
 4455098 │   296    299    300      0      0      1
 4455099 │   297    299    300      0      0      1
 4455100 │   298    299    300      1      0      1
                                4455092 rows omitted

julia> transform!(df, r"t" => ByRow(result) => :result)
4455100×7 DataFrame
     Row │ x1     x2     x3     t1     t2     t3     result
         │ Int64  Int64  Int64  Int64  Int64  Int64  Bool
─────────┼──────────────────────────────────────────────────
       1 │     1      2      3      1      1      0    true
       2 │     1      2      4      1      1      1   false
       3 │     1      3      4      1      0      1    true
       4 │     2      3      4      1      0      1    true
    ⋮    │   ⋮      ⋮      ⋮      ⋮      ⋮      ⋮      ⋮
 4455097 │   295    299    300      0      0      1    true
 4455098 │   296    299    300      0      0      1    true
 4455099 │   297    299    300      0      0      1    true
 4455100 │   298    299    300      1      0      1   false
                                        4455092 rows omitted

We have our data ready. Let us first check if indeed the
theoretical computations we did above hold.

Initial screening of data

Start with computing mean scoring probability by player
and number of times each player was present in
the data (and sort the data frame by mean result
so see if there are any large differences in it):

julia> @chain df begin
           [select(_, "x$i" => :x, :result) for i in 1:3]
               vcat(_...)
           groupby(:x)
           combine(:result => mean, nrow)
           sort!(:result_mean)
       end
300×3 DataFrame
 Row │ x      result_mean  nrow
     │ Int64  Float64      Int64
─────┼───────────────────────────
   1 │   106     0.341294  44551
   2 │   121     0.341429  44551
   3 │    10     0.341878  44551
   4 │    46     0.342529  44551
   5 │    71     0.342596  44551
   6 │     2     0.342663  44551
   7 │   281     0.342776  44551
   8 │    60     0.342865  44551
   9 │   233     0.342888  44551
  10 │    73     0.343045  44551
  11 │   266     0.34318   44551
  12 │   131     0.343202  44551
  13 │    40     0.343225  44551
  14 │   138     0.343292  44551
  ⋮  │   ⋮         ⋮         ⋮
 288 │   176     0.350205  44551
 289 │   293     0.350205  44551
 290 │    19     0.35034   44551
 291 │    48     0.350407  44551
 292 │   155     0.350632  44551
 293 │   195     0.350744  44551
 294 │   132     0.350879  44551
 295 │    96     0.351103  44551
 296 │   110     0.35144   44551
 297 │   172     0.351687  44551
 298 │   254     0.351687  44551
 299 │   231     0.352383  44551
 300 │   163     0.354964  44551
                 273 rows omitted

Indeed we see that all values are close to 34.(6)% and
that for each player we have 299*298/2 entries in our table
as expected.

The second check is mean scoring probability by player type
and fraction of player types:

julia> @chain df begin
           [select(_, "t$i" => :t, :result) for i in 1:3]
               vcat(_...)
           groupby(:t)
               combine(:result => mean, proprow)
       end
2×3 DataFrame
 Row │ t      result_mean  proprow
     │ Int64  Float64      Float64
─────┼──────────────────────────────
   1 │     0     0.346636  0.333333
   2 │     1     0.346613  0.666667

Again the results are consistent with what we should get.

Let us now turn to checking the scoring probability and distribution by
player type:

julia> @chain df begin
           [select(_, ["t$i", "t$(mod1(i+1, 3))"] =>
                      ByRow(minmax) =>
                      [:ta, :tb], :result) for i in 1:3]
           vcat(_...)
               groupby([:ta, :tb])
           combine(:result => mean, proprow)
       end
3×4 DataFrame
 Row │ ta     tb     result_mean  proprow
     │ Int64  Int64  Float64      Float64
─────┼─────────────────────────────────────
   1 │     0      0     0.346844  0.110368
   2 │     0      1     0.346532  0.445931
   3 │     1      1     0.346653  0.443701

All looks good again. The mean is constant and we get the 1:4:4 proportion
of different player types.

Finally we check the three way combinations:

julia> combine(groupby(df, r"t\d"), :result => mean, proprow)
8×5 DataFrame
 Row │ t1     t2     t3     result_mean  proprow
     │ Int64  Int64  Int64  Float64      Float64
─────┼─────────────────────────────────────────────
   1 │     0      0      0     0.238695  0.0362955
   2 │     0      0      1     0.398932  0.0821452
   3 │     0      1      0     0.400749  0.076282
   4 │     0      1      1     0.319947  0.169792
   5 │     1      0      0     0.399914  0.06379
   6 │     1      0      1     0.320193  0.14399
   7 │     1      1      0     0.319905  0.132896
   8 │     1      1      1     0.360107  0.294808

Now we indeed see the expected differences in means.
Also the computed probabilities of seeing a given combination
of values are close to expectation, e.g. for 0-0-0 combination
we expected (1/3)^3 ≈ 3.7% and for 1-1-1 we expected (1/3)^3 ≈ 29.6%
of observations.

Solving the puzzle

From the analysis above we know that we need to take three-way
relationships into account to distingiush centers from wingers.

Let me recall you that we assumed that:

  • you do not know who is a center and who is a winger;
  • you do not know what is the fraction of centers and wingers in the data;
  • you do not know the scoring probabilities.

Therefore we are in an unsupervised learning setting. What we are looking
for is some representation of this data under which wingers and centers
would form separate clusters. We cannot hope to say which players are centers
and which are wingers, but maybe we can visually separate them.

The idea is to represent our data as a matrix (designed in a way that it
captures three-way relationships between players) and then project
it to two dimensions (and hope to see some nice pattern).

How to define the matrix. Let us try the following. I call the matrix
inc2. It will have 300 rows and 300*299 columns.
Rows are representing node numbers. Columns are representing node pairs
(and we potentially have 300*299 such pairs).
The player pair (a, b) is assigned to column number j=300*(a-1) + b.
I put 1 in the inc2[i, j] cell if there is an unordered triplet {i, a, b}
scored a goal. Otherwise I put 0 in the matrix.
Note that this matrix is sparse (some of its columns are all zeros), but my
data is small enough that I do not need to optimize the code to take this into account.

Here is the code generating the matrix:

julia> inc2 = zeros(Bool, 300, 300*299)
300×89700 Matrix{Bool}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 ⋮              ⋮              ⋮              ⋮           ⋱              ⋮              ⋮              ⋮
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

julia> for (i, row) in enumerate(eachrow(subset(df, :result)))
           inc2[row.x1, 300*(row.x2-1) + row.x3] = true
           inc2[row.x2, 300*(row.x1-1) + row.x3] = true
           inc2[row.x3, 300*(row.x1-1) + row.x2] = true
       end

julia> inc2
300×89700 Matrix{Bool}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  1  0  0  0  0  0  1  0  0  0  0  1  0  1  1  0  1     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  1  0  1  0  0  1  0  1  0  0  1  0  1  1  1  0  0  1     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
 0  0  1  0  0  1  0  0  0  0  1  0  1  0  1  0  1  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  1  0  0  1  0  0  0  1  1  0  0  0  1     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
 0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  1  0  1  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 ⋮              ⋮              ⋮              ⋮           ⋱              ⋮              ⋮              ⋮
 0  0  1  0  0  1  1  0  0  1  0  0  0  1  1  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
 0  0  0  0  0  0  1  1  0  0  0  0  0  1  0  0  0  0  1  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
 0  0  0  0  0  0  0  0  1  0  1  0  1  0  1  0  0  0  1     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
 0  1  0  1  0  0  0  1  0  0  1  1  0  1  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  1  0  0  0  0  0  0  1  0  1  1  0  1     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  1  1  1  0  1  0  1  0  1  0  0  0  0  0  1  1  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

I called the matrix inc2 as it represents incidence between nodes and pairs of nodes.
In the for loop above I took advantage of the fact that I generated my data so that x1 < x2 < x3.

Let us quickly check if indeed each node has approximately the same number of 1 entries:

julia> sum(inc2, dims=2)
300×1 Matrix{Int64}:
 15340
 15266
 15362
 15402
 15523
 15324
     ⋮
 15324
 15464
 15367
 15543
 15470
 15495

Indeed we see that we roughly have 44551 * 34.(6)% for all rows.

Now we have a 89700-dimensional representation of each node. We want to compress it to 1-dimension.
Let us pick the classic PCA. In MultivariateStats.jl observations are stored in columns,
so I work with a transposition of inc2 matrix:

julia> p = MultivariateStats.fit(MultivariateStats.PCA, inc2', maxoutdim=1);

julia> t = MultivariateStats.transform(p, inc2')
1×300 Matrix{Float64}:
 -7.38127  -7.47636  14.2659  -5.88158  13.5279  -6.48501  13.0815  …  13.9629  14.2848  13.4581  -7.5586  14.5669  -7.71527

Now for each observation (column in the t matrix) we have its 1-dimensional representation.

Let us visually check if these components have some structure. Start with a histogram:

julia> histogram(vec(t); nbins=50, legend=false)

You can see the produced plot below:

Histogram

It looks promising. We got two separated clusters. But are they connected with our ids?

julia> histogram(t[ids .== 1]; nbins=20, label="1")

julia> histogram!(t[ids .== 0]; nbins=10, label="0")

You can see the produced plot below:

Histogram splitted

Indeed we got a perfect separation it seems. Let us double check it:

julia> describe(t[ids .== 1])
Summary Stats:
Length:         200
Missing Count:  0
Mean:           -6.914677
Minimum:        -9.086461
1st Quartile:   -7.496597
Median:         -6.926006
3rd Quartile:   -6.359373
Maximum:        -4.606485
Type:           Float64

julia> describe(t[ids .== 0])
Summary Stats:
Length:         100
Missing Count:  0
Mean:           13.829354
Minimum:        12.317402
1st Quartile:   13.322935
Median:         13.911846
3rd Quartile:   14.269240
Maximum:        15.791385
Type:           Float64

The distributions confirm that we recovered the communities correctly.

Conclusions

The presented approach is one of the many that could be used to solve the puzzle.
I encourage you to think of alternative approaches that could be used.

However, I think that the puzzle nicely shows that sometimes slightly
reformulating the problem allows you to use standard data science tools to solve it.

I hope you enjoyed it!

Pivot tables in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/05/19/pivot.html

Introduction

Creation of pivot tables is a common operation in exploratory data analysis.
Today I want to show you one example how this can be done in DataFrames.jl
that was prompted by a recent discussion on Julia Slack.

The post was written under Julia 1.9.0, Chain 0.5.0, DataFrames.jl 1.5.0, and RDatasets 0.7.7.

The data

We will work with the classical diamonds dataset. Let us load it first:

julia> using DataFrames

julia> using Chain

julia> using RDatasets

julia> diamonds = RDatasets.dataset("ggplot2", "diamonds")
53940×10 DataFrame
   Row │ Carat    Cut        Color  Clarity  Depth    Table    Price  X        Y        Z
       │ Float64  Cat…       Cat…   Cat…     Float64  Float64  Int32  Float64  Float64  Float64
───────┼────────────────────────────────────────────────────────────────────────────────────────
     1 │    0.23  Ideal      E      SI2         61.5     55.0    326     3.95     3.98     2.43
     2 │    0.21  Premium    E      SI1         59.8     61.0    326     3.89     3.84     2.31
     3 │    0.23  Good       E      VS1         56.9     65.0    327     4.05     4.07     2.31
     4 │    0.29  Premium    I      VS2         62.4     58.0    334     4.2      4.23     2.63
     5 │    0.31  Good       J      SI2         63.3     58.0    335     4.34     4.35     2.75
     6 │    0.24  Very Good  J      VVS2        62.8     57.0    336     3.94     3.96     2.48
     7 │    0.24  Very Good  I      VVS1        62.3     57.0    336     3.95     3.98     2.47
     8 │    0.26  Very Good  H      SI1         61.9     55.0    337     4.07     4.11     2.53
     9 │    0.22  Fair       E      VS2         65.1     61.0    337     3.87     3.78     2.49
    10 │    0.23  Very Good  H      VS1         59.4     61.0    338     4.0      4.05     2.39
    11 │    0.3   Good       J      SI1         64.0     55.0    339     4.25     4.28     2.73
    12 │    0.23  Ideal      J      VS1         62.8     56.0    340     3.93     3.9      2.46
    13 │    0.22  Premium    F      SI1         60.4     61.0    342     3.88     3.84     2.33
    14 │    0.31  Ideal      J      SI2         62.2     54.0    344     4.35     4.37     2.71
   ⋮   │    ⋮         ⋮        ⋮       ⋮        ⋮        ⋮       ⋮       ⋮        ⋮        ⋮
 53928 │    0.79  Good       F      SI1         58.1     59.0   2756     6.06     6.13     3.54
 53929 │    0.79  Premium    E      SI2         61.4     58.0   2756     6.03     5.96     3.68
 53930 │    0.71  Ideal      G      VS1         61.4     56.0   2756     5.76     5.73     3.53
 53931 │    0.71  Premium    E      SI1         60.5     55.0   2756     5.79     5.74     3.49
 53932 │    0.71  Premium    F      SI1         59.8     62.0   2756     5.74     5.73     3.43
 53933 │    0.7   Very Good  E      VS2         60.5     59.0   2757     5.71     5.76     3.47
 53934 │    0.7   Very Good  E      VS2         61.2     59.0   2757     5.69     5.72     3.49
 53935 │    0.72  Premium    D      SI1         62.7     59.0   2757     5.69     5.73     3.58
 53936 │    0.72  Ideal      D      SI1         60.8     57.0   2757     5.75     5.76     3.5
 53937 │    0.72  Good       D      SI1         63.1     55.0   2757     5.69     5.75     3.61
 53938 │    0.7   Very Good  D      SI1         62.8     60.0   2757     5.66     5.68     3.56
 53939 │    0.86  Premium    H      SI2         61.0     58.0   2757     6.15     6.12     3.74
 53940 │    0.75  Ideal      D      SI2         62.2     55.0   2757     5.83     5.87     3.64
                                                                              53913 rows omitted

The task we want to do is to analyze the distribution of :Cut column by :Color.

Note that these columns are Categorical (as indicated by the Cat… information above).
This allows us to check levels of :Cut and :Color to verify their ordering.

julia> levels(diamonds.Cut)
5-element Vector{String}:
 "Fair"
 "Good"
 "Very Good"
 "Premium"
 "Ideal"

julia> levels(diamonds.Color)
7-element Vector{String}:
 "D"
 "E"
 "F"
 "G"
 "H"
 "I"
 "J"

Now we are ready to start analyzing the data.

Simple pivot table

A simple pivot table would be to calculate number of observations of each
:Cut, and :Color combination. You can do it as follows:

julia> unstack(diamonds, :Cut, :Color, :Cut, combine=length)
5×8 DataFrame
 Row │ Cut        E       I       J       H       F       G       D
     │ Cat…       Int64?  Int64?  Int64?  Int64?  Int64?  Int64?  Int64?
─────┼───────────────────────────────────────────────────────────────────
   1 │ Ideal        3903    2093     896    3115    3826    4884    2834
   2 │ Premium      2337    1428     808    2360    2331    2924    1603
   3 │ Good          933     522     307     702     909     871     662
   4 │ Very Good    2400    1204     678    1824    2164    2299    1513
   5 │ Fair          224     175     119     303     312     314     163

We see that we put the second positional argument of unstack (:Cut in our case) as rows,
and the third (:Color) as columns. The fourth positional argument is what we put in the cells
of the pivot table. Since we want to get the number of observations (combine=length) then
it does not matter which column we pass so I used :Cut.

Fixing the order

The table looks nice, but there is one problem with it. The rows and columns are not ordered nicely.
The reason is that currently unstack in DataFrames.jl orders them by the order of their appearance
in the source data frame.

We can fix it by sorting. The order of columns is set by pre-sorting the source data frame,
and the order of rows is set by post-sorting of the data frame returned by unstack.
Note that I start using @chain macro from Chain.jl for clarity of the code:

julia> @chain diamonds begin
           sort(:Color)
           unstack(:Cut, :Color, :Cut, combine=length)
           sort!(:Cut)
       end
5×8 DataFrame
 Row │ Cut        D       E       F       G       H       I       J
     │ Cat…       Int64?  Int64?  Int64?  Int64?  Int64?  Int64?  Int64?
─────┼───────────────────────────────────────────────────────────────────
   1 │ Fair          163     224     312     314     303     175     119
   2 │ Good          662     933     909     871     702     522     307
   3 │ Very Good    1513    2400    2164    2299    1824    1204     678
   4 │ Premium      1603    2337    2331    2924    2360    1428     808
   5 │ Ideal        2834    3903    3826    4884    3115    2093     896

Now the table is nicely ordered. Notice that both sort and sort! functions are aware of
categorical nature of data and properly sort it.

We almost have what we wanted. The problem is that seeing counts does not allow us to easily
assess the distributions by diamond color. This can be easily added by transforming the columns
of our data frame.

Getting proportions

Let us turn the data from counts to proportions. We can do it using the transform! function:

julia> @chain diamonds begin
           sort(:Color)
           unstack(:Cut, :Color, :Cut, combine=length)
           sort!(:Cut)
           transform(Not(:Cut) .=> x -> x / sum(x), renamecols=false)
       end
5×8 DataFrame
 Row │ Cut        D          E          F          G          H          I          J
     │ Cat…       Float64    Float64    Float64    Float64    Float64    Float64    Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────
   1 │ Fair       0.024059   0.0228641  0.0326975  0.0278073  0.0364884  0.0322759  0.0423789
   2 │ Good       0.0977122  0.0952332  0.095263   0.0771343  0.0845376  0.0962744  0.10933
   3 │ Very Good  0.223321   0.244973   0.226787   0.203595   0.219653   0.222058   0.241453
   4 │ Premium    0.236605   0.238542   0.244288   0.258944   0.2842     0.263371   0.287749
   5 │ Ideal      0.418303   0.398387   0.400964   0.432519   0.37512    0.38602    0.319088

Indeed J diamonds seem to have worst :Cut, and the best are G diamonds.

Digging deeper into the data

To formally assess the order of columns by :Cut quality let us turn the data from distribution
to a cumulative distribution first:

julia> @chain diamonds begin
           sort(:Color)
           unstack(:Cut, :Color, :Cut, combine=length)
           sort!(:Cut)
           transform!(Not(:Cut) .=> x -> cumsum(x / sum(x)), renamecols=false)
       end
5×8 DataFrame
 Row │ Cut        D         E          F          G          H          I          J
     │ Cat…       Float64   Float64    Float64    Float64    Float64    Float64    Float64
─────┼───────────────────────────────────────────────────────────────────────────────────────
   1 │ Fair       0.024059  0.0228641  0.0326975  0.0278073  0.0364884  0.0322759  0.0423789
   2 │ Good       0.121771  0.118097   0.127961   0.104942   0.121026   0.12855    0.151709
   3 │ Very Good  0.345092  0.36307    0.354747   0.308537   0.340679   0.350609   0.393162
   4 │ Premium    0.581697  0.601613   0.599036   0.567481   0.62488    0.61398    0.680912
   5 │ Ideal      1.0       1.0        1.0        1.0        1.0        1.0        1.0

We would like to order columns by the first-order stochastic dominance relation.
Since DataFrames.jl makes it easier to sort rows, let us permute the dimensions of our data frame first:

julia> @chain diamonds begin
           sort(:Color)
           unstack(:Cut, :Color, :Cut, combine=length)
           sort!(:Cut)
           transform!(Not(:Cut) .=> x -> cumsum(x / sum(x)), renamecols=false)
           permutedims(:Cut, :Color)
       end
7×6 DataFrame
 Row │ Color   Fair       Good      Very Good  Premium   Ideal
     │ String  Float64    Float64   Float64    Float64   Float64
─────┼───────────────────────────────────────────────────────────
   1 │ D       0.024059   0.121771   0.345092  0.581697      1.0
   2 │ E       0.0228641  0.118097   0.36307   0.601613      1.0
   3 │ F       0.0326975  0.127961   0.354747  0.599036      1.0
   4 │ G       0.0278073  0.104942   0.308537  0.567481      1.0
   5 │ H       0.0364884  0.121026   0.340679  0.62488       1.0
   6 │ I       0.0322759  0.12855    0.350609  0.61398       1.0
   7 │ J       0.0423789  0.151709   0.393162  0.680912      1.0

Now we are ready to sort our data frame by all columns except :Color:

julia> @chain diamonds begin
           sort(:Color)
           unstack(:Cut, :Color, :Cut, combine=length)
           sort!(:Cut)
           transform!(Not(:Cut) .=> x -> cumsum(x / sum(x)), renamecols=false)
           permutedims(:Cut, :Color)
           sort!(Not(:Color))
       end
7×6 DataFrame
 Row │ Color   Fair       Good      Very Good  Premium   Ideal
     │ String  Float64    Float64   Float64    Float64   Float64
─────┼───────────────────────────────────────────────────────────
   1 │ E       0.0228641  0.118097   0.36307   0.601613      1.0
   2 │ D       0.024059   0.121771   0.345092  0.581697      1.0
   3 │ G       0.0278073  0.104942   0.308537  0.567481      1.0
   4 │ I       0.0322759  0.12855    0.350609  0.61398       1.0
   5 │ F       0.0326975  0.127961   0.354747  0.599036      1.0
   6 │ H       0.0364884  0.121026   0.340679  0.62488       1.0
   7 │ J       0.0423789  0.151709   0.393162  0.680912      1.0

The sorting exercise did not work this time.
First-order stochastic dominance does not render the best and the worst option.
Note that e.g. options E and G do not dominate each other.
However we see that option J is clearly worst as it has maximum values in all levels.

To see this more clearly let us do min-max scaling of all columns except :Ideal (as it is constant):

julia> scale(x) = (x .- minimum(x)) / (maximum(x) - minimum(x))
scale (generic function with 1 method)

julia> @chain diamonds begin
           sort(:Color)
           unstack(:Cut, :Color, :Cut, combine=length)
           sort!(:Cut)
           transform!(Not(:Cut) .=> x -> cumsum(x / sum(x)), renamecols=false)
           permutedims(:Cut, :Color)
           sort!(Not(:Color))
           transform!(Not([:Color, :Ideal]) .=> scale .=> identity)
       end
7×6 DataFrame
 Row │ Color   Fair       Good      Very Good  Premium   Ideal
     │ String  Float64    Float64   Float64    Float64   Float64
─────┼───────────────────────────────────────────────────────────
   1 │ E       0.0        0.281301   0.644408  0.300901      1.0
   2 │ D       0.0612305  0.359855   0.431965  0.125328      1.0
   3 │ G       0.253303   0.0        0.0       0.0           1.0
   4 │ I       0.482289   0.504808   0.497151  0.409932      1.0
   5 │ F       0.503895   0.492198   0.546059  0.278184      1.0
   6 │ H       0.698153   0.343921   0.379817  0.506022      1.0
   7 │ J       1.0        1.0        1.0       1.0           1.0

Indeed J color has 1.0 values in all columns.

We also now can more clearly see that if we ignore the :Fair level
the G color dominates all other colors.

Note that in the last step I have shown an alternative to renamecols=false
of how one can keep the column names unchanged under transformation.
What you can do is pass the identity function as target column name.

The reason is that if you pass a function as target column name then this function is applied
to source column name (and identity keeps things as they were).

Conclusions

I hope you found this post useful for exploring some of the functionalities
of DataFrames.jl.

I also tried to show how nicely @chain can be used to gradually build
an analysis. This is especially convenient in Julia REPL, since when you
go back in command history it allows you to get whole last command
(and not just a single line like in some other REPLs)
in the prompt with one up arrow key press and edit it.