Re-posted from: https://bkamins.github.io/julialang/2023/05/26/waw2023.html

Introduction

Today is the last day of WAW2023 conference.
It was a great experience to attend it. I had an opportunity
to meet some of the top experts in analyzing and modeling graphs
and hypergraphs. In particular, it was really nice to listen
to several talks on tools available in the Julia language
for processing such data.

Out of many problems discussed during the conference I picked
one that I particularly like and would like to cover it today.

The problem

The task was proposed by Lasse Leskelä from Aalto University.

The problem is formulated as follows. Assume we have two types
of hockey attackers: wingers and centers. There are 300 players
in total out of which 100 are wingers and 200 are centers.

The players go to ice in triplets. The probability that a given
triplet scores a goal in a single action depends on its composition
and is as follows:

winger-winger-winger: 0.24
winger-winger-center: 0.40
winger-center-center: 0.32
center-center-center: 0.36

As you can see the best composition is to have one center and two wingers,
and the worst composition of the players is three wingers.

Now assume that:

you do not know who is a center and who is a winger;
you do not know what is the fraction of centers and wingers in the data;
you do not know the scoring probabilities.

The question is if you are given the information about scoring performance
of all triplets that can be formed from these 300 players (each triplet
is allowed to play exactly one action) can you identify from this data that
there are indeed two groups of players and correctly identify them.

Why is this problem interesting?

A natural question you might ask what is the expected proportion of
goals scored by a center and by a winger in our data. Let us calculate it.

If I am a center then my probability of scoring is:

1/9 * 0.40 + 4/9 * 0.32 + 4/9 * 0.36 = 26/75 = 34.(6)%

The probabilities 1/9, 4/9 and 4/9 correspond to probability
that a player is matched with winger-winger, winger-center, and center-center
combination respectively (if you are not sure about them we will soon
check them with data).

The same calculation for winger yields:

1/9 * 0.24 + 4/9 * 0.40 + 4/9 * 0.32 = 26/75 = 34.(6)%

As you can see wingers and centers score with the same
probability on the average in our data.

Therefore we cannot expect to be able to distinguish centers
and wingers by just looking at their average scoring.

Maybe then looking at pairs of players could help us?
Let us check what is the scoring probability of winger-winger,
winger-center, and center-center pairs. They are as follows:

For winger-winger we get:

2/3 * 0.4 + 1/3 * 0.24 = 26/74 = 34.(6)%

For winger-center:

2/3 * 0.32 + 1/3 * 0.40 = 26/74 = 34.(6)%

And for center-center:

2/3 * 0.36 + 1/3 * 0.32 = 26/74 = 34.(6)%

Unfortunately looking at pairs of players still would not give
us enough insight to solve the problem. It turns our that
we will need to take into account three-way interactions between
players to be able to hope to solve our problem.

In graph perspective we can look at this problem as follows.
Players are nodes and their triplets that scored form a three
element hyperedge. What we have learned above is that to solve
the problem we need to use some method that is hyperege-aware.

Preparing the data

To set up the stage start with generating sample data for our problem.
I the code I encode wingers with 0 and centers with 1.
Therefore if we take three players the sum of their codes ranges
from 0 to 3 and encodes the composition of the attacking team.

First load the packages:

julia> using Random

julia> using Graphs

julia> using DataFrames

julia> using Plots

julia> using Chain

julia> using Statistics

julia> import MultivariateStats

Next define a function result that maps player type triplet to a boolean
value that tells us if a given set of players scored (with probabilities
I described above):

julia> prob(t1, t2, t3) = (0.24, 0.40, 0.32, 0.36)[t1 + t2 + t3 + 1]
prob (generic function with 1 method)

julia> result(t1, t2, t3) = rand() < prob(t1, t2, t3)
result (generic function with 1 method)

Now we are ready to generate players. Start with assigning
each player a random value of winger and center:

julia> ids = shuffle!([fill(0, n_winger); fill(1, n_center)])
300-element Vector{Int64}:
 1
 1
 0
 ⋮

 1
 0
 1

We are now ready to generate the table holding all player triplets.
Note that I want to ensure that each triplet is stored in it only
once (with respect to permutation) so I filter it ensuring that
player numbers are increasing:

julia> df = allcombinations(DataFrame, x1=1:300, x2=1:300, x3=1:300)
27000000×3 DataFrame
      Row │ x1     x2     x3
          │ Int64  Int64  Int64
──────────┼─────────────────────
        1 │     1      1      1
        2 │     2      1      1
        3 │     3      1      1
        4 │     4      1      1
    ⋮     │   ⋮      ⋮      ⋮
 26999997 │   297    300    300
 26999998 │   298    300    300
 26999999 │   299    300    300
 27000000 │   300    300    300
           26999992 rows omitted

julia> subset!(df, [[:x1, :x2], [:x2, :x3]] .=> ByRow(<))
4455100×3 DataFrame
     Row │ x1     x2     x3
         │ Int64  Int64  Int64
─────────┼─────────────────────
       1 │     1      2      3
       2 │     1      2      4
       3 │     1      3      4
       4 │     2      3      4
    ⋮    │   ⋮      ⋮      ⋮
 4455097 │   295    299    300
 4455098 │   296    299    300
 4455099 │   297    299    300
 4455100 │   298    299    300
           4455092 rows omitted

Now for each triplet we add the player type information
and next call the result function to randomly decide
if a given player pair scored or not:

julia> transform!(df, All() .=> (x -> ids[x]) .=> [:t1, :t2, :t3])
4455100×6 DataFrame
     Row │ x1     x2     x3     t1     t2     t3
         │ Int64  Int64  Int64  Int64  Int64  Int64
─────────┼──────────────────────────────────────────
       1 │     1      2      3      1      1      0
       2 │     1      2      4      1      1      1
       3 │     1      3      4      1      0      1
       4 │     2      3      4      1      0      1
    ⋮    │   ⋮      ⋮      ⋮      ⋮      ⋮      ⋮
 4455097 │   295    299    300      0      0      1
 4455098 │   296    299    300      0      0      1
 4455099 │   297    299    300      0      0      1
 4455100 │   298    299    300      1      0      1
                                4455092 rows omitted

julia> transform!(df, r"t" => ByRow(result) => :result)
4455100×7 DataFrame
     Row │ x1     x2     x3     t1     t2     t3     result
         │ Int64  Int64  Int64  Int64  Int64  Int64  Bool
─────────┼──────────────────────────────────────────────────
       1 │     1      2      3      1      1      0    true
       2 │     1      2      4      1      1      1   false
       3 │     1      3      4      1      0      1    true
       4 │     2      3      4      1      0      1    true
    ⋮    │   ⋮      ⋮      ⋮      ⋮      ⋮      ⋮      ⋮
 4455097 │   295    299    300      0      0      1    true
 4455098 │   296    299    300      0      0      1    true
 4455099 │   297    299    300      0      0      1    true
 4455100 │   298    299    300      1      0      1   false
                                        4455092 rows omitted

We have our data ready. Let us first check if indeed the
theoretical computations we did above hold.

Initial screening of data

Start with computing mean scoring probability by player
and number of times each player was present in
the data (and sort the data frame by mean result
so see if there are any large differences in it):

julia> @chain df begin
           [select(_, "x$i" => :x, :result) for i in 1:3]
               vcat(_...)
           groupby(:x)
           combine(:result => mean, nrow)
           sort!(:result_mean)
       end
300×3 DataFrame
 Row │ x      result_mean  nrow
     │ Int64  Float64      Int64
─────┼───────────────────────────
   1 │   106     0.341294  44551
   2 │   121     0.341429  44551
   3 │    10     0.341878  44551
   4 │    46     0.342529  44551
   5 │    71     0.342596  44551
   6 │     2     0.342663  44551
   7 │   281     0.342776  44551
   8 │    60     0.342865  44551
   9 │   233     0.342888  44551
  10 │    73     0.343045  44551
  11 │   266     0.34318   44551
  12 │   131     0.343202  44551
  13 │    40     0.343225  44551
  14 │   138     0.343292  44551
  ⋮  │   ⋮         ⋮         ⋮
 288 │   176     0.350205  44551
 289 │   293     0.350205  44551
 290 │    19     0.35034   44551
 291 │    48     0.350407  44551
 292 │   155     0.350632  44551
 293 │   195     0.350744  44551
 294 │   132     0.350879  44551
 295 │    96     0.351103  44551
 296 │   110     0.35144   44551
 297 │   172     0.351687  44551
 298 │   254     0.351687  44551
 299 │   231     0.352383  44551
 300 │   163     0.354964  44551
                 273 rows omitted

Indeed we see that all values are close to 34.(6)% and
that for each player we have 299*298/2 entries in our table
as expected.

The second check is mean scoring probability by player type
and fraction of player types:

julia> @chain df begin
           [select(_, "t$i" => :t, :result) for i in 1:3]
               vcat(_...)
           groupby(:t)
               combine(:result => mean, proprow)
       end
2×3 DataFrame
 Row │ t      result_mean  proprow
     │ Int64  Float64      Float64
─────┼──────────────────────────────
   1 │     0     0.346636  0.333333
   2 │     1     0.346613  0.666667

Again the results are consistent with what we should get.

Let us now turn to checking the scoring probability and distribution by
player type:

julia> @chain df begin
           [select(_, ["t$i", "t$(mod1(i+1, 3))"] =>
                      ByRow(minmax) =>
                      [:ta, :tb], :result) for i in 1:3]
           vcat(_...)
               groupby([:ta, :tb])
           combine(:result => mean, proprow)
       end
3×4 DataFrame
 Row │ ta     tb     result_mean  proprow
     │ Int64  Int64  Float64      Float64
─────┼─────────────────────────────────────
   1 │     0      0     0.346844  0.110368
   2 │     0      1     0.346532  0.445931
   3 │     1      1     0.346653  0.443701

All looks good again. The mean is constant and we get the 1:4:4 proportion
of different player types.

Finally we check the three way combinations:

julia> combine(groupby(df, r"t\d"), :result => mean, proprow)
8×5 DataFrame
 Row │ t1     t2     t3     result_mean  proprow
     │ Int64  Int64  Int64  Float64      Float64
─────┼─────────────────────────────────────────────
   1 │     0      0      0     0.238695  0.0362955
   2 │     0      0      1     0.398932  0.0821452
   3 │     0      1      0     0.400749  0.076282
   4 │     0      1      1     0.319947  0.169792
   5 │     1      0      0     0.399914  0.06379
   6 │     1      0      1     0.320193  0.14399
   7 │     1      1      0     0.319905  0.132896
   8 │     1      1      1     0.360107  0.294808

Now we indeed see the expected differences in means.
Also the computed probabilities of seeing a given combination
of values are close to expectation, e.g. for 0-0-0 combination
we expected (1/3)^3 ≈ 3.7% and for 1-1-1 we expected (1/3)^3 ≈ 29.6%
of observations.

Solving the puzzle

From the analysis above we know that we need to take three-way
relationships into account to distingiush centers from wingers.

Let me recall you that we assumed that:

you do not know who is a center and who is a winger;
you do not know what is the fraction of centers and wingers in the data;
you do not know the scoring probabilities.

Therefore we are in an unsupervised learning setting. What we are looking
for is some representation of this data under which wingers and centers
would form separate clusters. We cannot hope to say which players are centers
and which are wingers, but maybe we can visually separate them.

The idea is to represent our data as a matrix (designed in a way that it
captures three-way relationships between players) and then project
it to two dimensions (and hope to see some nice pattern).

How to define the matrix. Let us try the following. I call the matrix
inc2. It will have 300 rows and 300*299 columns.
Rows are representing node numbers. Columns are representing node pairs
(and we potentially have 300*299 such pairs).
The player pair (a, b) is assigned to column number j=300*(a-1) + b.
I put 1 in the inc2[i, j] cell if there is an unordered triplet {i, a, b}
scored a goal. Otherwise I put 0 in the matrix.
Note that this matrix is sparse (some of its columns are all zeros), but my
data is small enough that I do not need to optimize the code to take this into account.

Here is the code generating the matrix:

julia> inc2 = zeros(Bool, 300, 300*299)
300×89700 Matrix{Bool}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 ⋮              ⋮              ⋮              ⋮           ⋱              ⋮              ⋮              ⋮
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

julia> for (i, row) in enumerate(eachrow(subset(df, :result)))
           inc2[row.x1, 300*(row.x2-1) + row.x3] = true
           inc2[row.x2, 300*(row.x1-1) + row.x3] = true
           inc2[row.x3, 300*(row.x1-1) + row.x2] = true
       end

julia> inc2
300×89700 Matrix{Bool}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  1  0  0  0  0  0  1  0  0  0  0  1  0  1  1  0  1     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  1  0  1  0  0  1  0  1  0  0  1  0  1  1  1  0  0  1     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
 0  0  1  0  0  1  0  0  0  0  1  0  1  0  1  0  1  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  1  0  0  1  0  0  0  1  1  0  0  0  1     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
 0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  1  0  1  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 ⋮              ⋮              ⋮              ⋮           ⋱              ⋮              ⋮              ⋮
 0  0  1  0  0  1  1  0  0  1  0  0  0  1  1  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
 0  0  0  0  0  0  1  1  0  0  0  0  0  1  0  0  0  0  1  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
 0  0  0  0  0  0  0  0  1  0  1  0  1  0  1  0  0  0  1     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
 0  1  0  1  0  0  0  1  0  0  1  1  0  1  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  1  0  0  0  0  0  0  1  0  1  1  0  1     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  1  1  1  0  1  0  1  0  1  0  0  0  0  0  1  1  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

I called the matrix inc2 as it represents incidence between nodes and pairs of nodes.
In the for loop above I took advantage of the fact that I generated my data so that x1 < x2 < x3.

Let us quickly check if indeed each node has approximately the same number of 1 entries:

julia> sum(inc2, dims=2)
300×1 Matrix{Int64}:
 15340
 15266
 15362
 15402
 15523
 15324
     ⋮
 15324
 15464
 15367
 15543
 15470
 15495

Indeed we see that we roughly have 44551 * 34.(6)% for all rows.

Now we have a 89700-dimensional representation of each node. We want to compress it to 1-dimension.
Let us pick the classic PCA. In MultivariateStats.jl observations are stored in columns,
so I work with a transposition of inc2 matrix:

julia> p = MultivariateStats.fit(MultivariateStats.PCA, inc2', maxoutdim=1);

julia> t = MultivariateStats.transform(p, inc2')
1×300 Matrix{Float64}:
 -7.38127  -7.47636  14.2659  -5.88158  13.5279  -6.48501  13.0815  …  13.9629  14.2848  13.4581  -7.5586  14.5669  -7.71527

Now for each observation (column in the t matrix) we have its 1-dimensional representation.

Let us visually check if these components have some structure. Start with a histogram:

julia> histogram(vec(t); nbins=50, legend=false)

You can see the produced plot below:

Histogram

It looks promising. We got two separated clusters. But are they connected with our ids?

julia> histogram(t[ids .== 1]; nbins=20, label="1")

julia> histogram!(t[ids .== 0]; nbins=10, label="0")

You can see the produced plot below:

Histogram splitted

Indeed we got a perfect separation it seems. Let us double check it:

julia> describe(t[ids .== 1])
Summary Stats:
Length:         200
Missing Count:  0
Mean:           -6.914677
Minimum:        -9.086461
1st Quartile:   -7.496597
Median:         -6.926006
3rd Quartile:   -6.359373
Maximum:        -4.606485
Type:           Float64

julia> describe(t[ids .== 0])
Summary Stats:
Length:         100
Missing Count:  0
Mean:           13.829354
Minimum:        12.317402
1st Quartile:   13.322935
Median:         13.911846
3rd Quartile:   14.269240
Maximum:        15.791385
Type:           Float64

The distributions confirm that we recovered the communities correctly.

Conclusions

The presented approach is one of the many that could be used to solve the puzzle.
I encourage you to think of alternative approaches that could be used.

However, I think that the puzzle nicely shows that sometimes slightly
reformulating the problem allows you to use standard data science tools to solve it.

I hope you enjoyed it!

juliabloggers.com

A Julia Language Blog Aggregator

Lessons learned from WAW2023 conference