Re-posted from: https://bkamins.github.io/julialang/2023/05/26/waw2023.html
Introduction
Today is the last day of WAW2023 conference.
It was a great experience to attend it. I had an opportunity
to meet some of the top experts in analyzing and modeling graphs
and hypergraphs. In particular, it was really nice to listen
to several talks on tools available in the Julia language
for processing such data.
Out of many problems discussed during the conference I picked
one that I particularly like and would like to cover it today.
The problem
The task was proposed by Lasse Leskelä from Aalto University.
The problem is formulated as follows. Assume we have two types
of hockey attackers: wingers and centers. There are 300 players
in total out of which 100 are wingers and 200 are centers.
The players go to ice in triplets. The probability that a given
triplet scores a goal in a single action depends on its composition
and is as follows:
- winger-winger-winger: 0.24
- winger-winger-center: 0.40
- winger-center-center: 0.32
- center-center-center: 0.36
As you can see the best composition is to have one center and two wingers,
and the worst composition of the players is three wingers.
Now assume that:
- you do not know who is a center and who is a winger;
- you do not know what is the fraction of centers and wingers in the data;
- you do not know the scoring probabilities.
The question is if you are given the information about scoring performance
of all triplets that can be formed from these 300 players (each triplet
is allowed to play exactly one action) can you identify from this data that
there are indeed two groups of players and correctly identify them.
Why is this problem interesting?
A natural question you might ask what is the expected proportion of
goals scored by a center and by a winger in our data. Let us calculate it.
If I am a center then my probability of scoring is:
1/9 * 0.40 + 4/9 * 0.32 + 4/9 * 0.36 = 26/75 = 34.(6)%
The probabilities 1/9
, 4/9
and 4/9
correspond to probability
that a player is matched with winger-winger, winger-center, and center-center
combination respectively (if you are not sure about them we will soon
check them with data).
The same calculation for winger yields:
1/9 * 0.24 + 4/9 * 0.40 + 4/9 * 0.32 = 26/75 = 34.(6)%
As you can see wingers and centers score with the same
probability on the average in our data.
Therefore we cannot expect to be able to distinguish centers
and wingers by just looking at their average scoring.
Maybe then looking at pairs of players could help us?
Let us check what is the scoring probability of winger-winger,
winger-center, and center-center pairs. They are as follows:
For winger-winger we get:
2/3 * 0.4 + 1/3 * 0.24 = 26/74 = 34.(6)%
For winger-center:
2/3 * 0.32 + 1/3 * 0.40 = 26/74 = 34.(6)%
And for center-center:
2/3 * 0.36 + 1/3 * 0.32 = 26/74 = 34.(6)%
Unfortunately looking at pairs of players still would not give
us enough insight to solve the problem. It turns our that
we will need to take into account three-way interactions between
players to be able to hope to solve our problem.
In graph perspective we can look at this problem as follows.
Players are nodes and their triplets that scored form a three
element hyperedge. What we have learned above is that to solve
the problem we need to use some method that is hyperege-aware.
Preparing the data
To set up the stage start with generating sample data for our problem.
I the code I encode wingers with 0
and centers with 1
.
Therefore if we take three players the sum of their codes ranges
from 0
to 3
and encodes the composition of the attacking team.
First load the packages:
julia> using Random
julia> using Graphs
julia> using DataFrames
julia> using Plots
julia> using Chain
julia> using Statistics
julia> import MultivariateStats
Next define a function result
that maps player type triplet to a boolean
value that tells us if a given set of players scored (with probabilities
I described above):
julia> prob(t1, t2, t3) = (0.24, 0.40, 0.32, 0.36)[t1 + t2 + t3 + 1]
prob (generic function with 1 method)
julia> result(t1, t2, t3) = rand() < prob(t1, t2, t3)
result (generic function with 1 method)
Now we are ready to generate players. Start with assigning
each player a random value of winger and center:
julia> ids = shuffle!([fill(0, n_winger); fill(1, n_center)])
300-element Vector{Int64}:
1
1
0
⋮
1
0
1
We are now ready to generate the table holding all player triplets.
Note that I want to ensure that each triplet is stored in it only
once (with respect to permutation) so I filter it ensuring that
player numbers are increasing:
julia> df = allcombinations(DataFrame, x1=1:300, x2=1:300, x3=1:300)
27000000×3 DataFrame
Row │ x1 x2 x3
│ Int64 Int64 Int64
──────────┼─────────────────────
1 │ 1 1 1
2 │ 2 1 1
3 │ 3 1 1
4 │ 4 1 1
⋮ │ ⋮ ⋮ ⋮
26999997 │ 297 300 300
26999998 │ 298 300 300
26999999 │ 299 300 300
27000000 │ 300 300 300
26999992 rows omitted
julia> subset!(df, [[:x1, :x2], [:x2, :x3]] .=> ByRow(<))
4455100×3 DataFrame
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────────┼─────────────────────
1 │ 1 2 3
2 │ 1 2 4
3 │ 1 3 4
4 │ 2 3 4
⋮ │ ⋮ ⋮ ⋮
4455097 │ 295 299 300
4455098 │ 296 299 300
4455099 │ 297 299 300
4455100 │ 298 299 300
4455092 rows omitted
Now for each triplet we add the player type information
and next call the result
function to randomly decide
if a given player pair scored or not:
julia> transform!(df, All() .=> (x -> ids[x]) .=> [:t1, :t2, :t3])
4455100×6 DataFrame
Row │ x1 x2 x3 t1 t2 t3
│ Int64 Int64 Int64 Int64 Int64 Int64
─────────┼──────────────────────────────────────────
1 │ 1 2 3 1 1 0
2 │ 1 2 4 1 1 1
3 │ 1 3 4 1 0 1
4 │ 2 3 4 1 0 1
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
4455097 │ 295 299 300 0 0 1
4455098 │ 296 299 300 0 0 1
4455099 │ 297 299 300 0 0 1
4455100 │ 298 299 300 1 0 1
4455092 rows omitted
julia> transform!(df, r"t" => ByRow(result) => :result)
4455100×7 DataFrame
Row │ x1 x2 x3 t1 t2 t3 result
│ Int64 Int64 Int64 Int64 Int64 Int64 Bool
─────────┼──────────────────────────────────────────────────
1 │ 1 2 3 1 1 0 true
2 │ 1 2 4 1 1 1 false
3 │ 1 3 4 1 0 1 true
4 │ 2 3 4 1 0 1 true
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
4455097 │ 295 299 300 0 0 1 true
4455098 │ 296 299 300 0 0 1 true
4455099 │ 297 299 300 0 0 1 true
4455100 │ 298 299 300 1 0 1 false
4455092 rows omitted
We have our data ready. Let us first check if indeed the
theoretical computations we did above hold.
Initial screening of data
Start with computing mean scoring probability by player
and number of times each player was present in
the data (and sort the data frame by mean result
so see if there are any large differences in it):
julia> @chain df begin
[select(_, "x$i" => :x, :result) for i in 1:3]
vcat(_...)
groupby(:x)
combine(:result => mean, nrow)
sort!(:result_mean)
end
300×3 DataFrame
Row │ x result_mean nrow
│ Int64 Float64 Int64
─────┼───────────────────────────
1 │ 106 0.341294 44551
2 │ 121 0.341429 44551
3 │ 10 0.341878 44551
4 │ 46 0.342529 44551
5 │ 71 0.342596 44551
6 │ 2 0.342663 44551
7 │ 281 0.342776 44551
8 │ 60 0.342865 44551
9 │ 233 0.342888 44551
10 │ 73 0.343045 44551
11 │ 266 0.34318 44551
12 │ 131 0.343202 44551
13 │ 40 0.343225 44551
14 │ 138 0.343292 44551
⋮ │ ⋮ ⋮ ⋮
288 │ 176 0.350205 44551
289 │ 293 0.350205 44551
290 │ 19 0.35034 44551
291 │ 48 0.350407 44551
292 │ 155 0.350632 44551
293 │ 195 0.350744 44551
294 │ 132 0.350879 44551
295 │ 96 0.351103 44551
296 │ 110 0.35144 44551
297 │ 172 0.351687 44551
298 │ 254 0.351687 44551
299 │ 231 0.352383 44551
300 │ 163 0.354964 44551
273 rows omitted
Indeed we see that all values are close to 34.(6)% and
that for each player we have 299*298/2 entries in our table
as expected.
The second check is mean scoring probability by player type
and fraction of player types:
julia> @chain df begin
[select(_, "t$i" => :t, :result) for i in 1:3]
vcat(_...)
groupby(:t)
combine(:result => mean, proprow)
end
2×3 DataFrame
Row │ t result_mean proprow
│ Int64 Float64 Float64
─────┼──────────────────────────────
1 │ 0 0.346636 0.333333
2 │ 1 0.346613 0.666667
Again the results are consistent with what we should get.
Let us now turn to checking the scoring probability and distribution by
player type:
julia> @chain df begin
[select(_, ["t$i", "t$(mod1(i+1, 3))"] =>
ByRow(minmax) =>
[:ta, :tb], :result) for i in 1:3]
vcat(_...)
groupby([:ta, :tb])
combine(:result => mean, proprow)
end
3×4 DataFrame
Row │ ta tb result_mean proprow
│ Int64 Int64 Float64 Float64
─────┼─────────────────────────────────────
1 │ 0 0 0.346844 0.110368
2 │ 0 1 0.346532 0.445931
3 │ 1 1 0.346653 0.443701
All looks good again. The mean is constant and we get the 1:4:4 proportion
of different player types.
Finally we check the three way combinations:
julia> combine(groupby(df, r"t\d"), :result => mean, proprow)
8×5 DataFrame
Row │ t1 t2 t3 result_mean proprow
│ Int64 Int64 Int64 Float64 Float64
─────┼─────────────────────────────────────────────
1 │ 0 0 0 0.238695 0.0362955
2 │ 0 0 1 0.398932 0.0821452
3 │ 0 1 0 0.400749 0.076282
4 │ 0 1 1 0.319947 0.169792
5 │ 1 0 0 0.399914 0.06379
6 │ 1 0 1 0.320193 0.14399
7 │ 1 1 0 0.319905 0.132896
8 │ 1 1 1 0.360107 0.294808
Now we indeed see the expected differences in means.
Also the computed probabilities of seeing a given combination
of values are close to expectation, e.g. for 0-0-0 combination
we expected (1/3)^3 ≈ 3.7%
and for 1-1-1 we expected (1/3)^3 ≈ 29.6%
of observations.
Solving the puzzle
From the analysis above we know that we need to take three-way
relationships into account to distingiush centers from wingers.
Let me recall you that we assumed that:
- you do not know who is a center and who is a winger;
- you do not know what is the fraction of centers and wingers in the data;
- you do not know the scoring probabilities.
Therefore we are in an unsupervised learning setting. What we are looking
for is some representation of this data under which wingers and centers
would form separate clusters. We cannot hope to say which players are centers
and which are wingers, but maybe we can visually separate them.
The idea is to represent our data as a matrix (designed in a way that it
captures three-way relationships between players) and then project
it to two dimensions (and hope to see some nice pattern).
How to define the matrix. Let us try the following. I call the matrix
inc2
. It will have 300
rows and 300*299
columns.
Rows are representing node numbers. Columns are representing node pairs
(and we potentially have 300*299
such pairs).
The player pair (a, b)
is assigned to column number j=300*(a-1) + b
.
I put 1
in the inc2[i, j]
cell if there is an unordered triplet {i, a, b}
scored a goal. Otherwise I put 0
in the matrix.
Note that this matrix is sparse (some of its columns are all zeros), but my
data is small enough that I do not need to optimize the code to take this into account.
Here is the code generating the matrix:
julia> inc2 = zeros(Bool, 300, 300*299)
300×89700 Matrix{Bool}:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
⋮ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
julia> for (i, row) in enumerate(eachrow(subset(df, :result)))
inc2[row.x1, 300*(row.x2-1) + row.x3] = true
inc2[row.x2, 300*(row.x1-1) + row.x3] = true
inc2[row.x3, 300*(row.x1-1) + row.x2] = true
end
julia> inc2
300×89700 Matrix{Bool}:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 1 0 0 1 0 1 0 0 1 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 1 0 0 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
⋮ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮
0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 1 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I called the matrix inc2
as it represents incidence between nodes and pairs of nodes.
In the for
loop above I took advantage of the fact that I generated my data so that x1 < x2 < x3
.
Let us quickly check if indeed each node has approximately the same number of 1
entries:
julia> sum(inc2, dims=2)
300×1 Matrix{Int64}:
15340
15266
15362
15402
15523
15324
⋮
15324
15464
15367
15543
15470
15495
Indeed we see that we roughly have 44551 * 34.(6)%
for all rows.
Now we have a 89700-dimensional representation of each node. We want to compress it to 1-dimension.
Let us pick the classic PCA. In MultivariateStats.jl observations are stored in columns,
so I work with a transposition of inc2
matrix:
julia> p = MultivariateStats.fit(MultivariateStats.PCA, inc2', maxoutdim=1);
julia> t = MultivariateStats.transform(p, inc2')
1×300 Matrix{Float64}:
-7.38127 -7.47636 14.2659 -5.88158 13.5279 -6.48501 13.0815 … 13.9629 14.2848 13.4581 -7.5586 14.5669 -7.71527
Now for each observation (column in the t
matrix) we have its 1-dimensional representation.
Let us visually check if these components have some structure. Start with a histogram:
julia> histogram(vec(t); nbins=50, legend=false)
You can see the produced plot below:
It looks promising. We got two separated clusters. But are they connected with our ids
?
julia> histogram(t[ids .== 1]; nbins=20, label="1")
julia> histogram!(t[ids .== 0]; nbins=10, label="0")
You can see the produced plot below:
Indeed we got a perfect separation it seems. Let us double check it:
julia> describe(t[ids .== 1])
Summary Stats:
Length: 200
Missing Count: 0
Mean: -6.914677
Minimum: -9.086461
1st Quartile: -7.496597
Median: -6.926006
3rd Quartile: -6.359373
Maximum: -4.606485
Type: Float64
julia> describe(t[ids .== 0])
Summary Stats:
Length: 100
Missing Count: 0
Mean: 13.829354
Minimum: 12.317402
1st Quartile: 13.322935
Median: 13.911846
3rd Quartile: 14.269240
Maximum: 15.791385
Type: Float64
The distributions confirm that we recovered the communities correctly.
Conclusions
The presented approach is one of the many that could be used to solve the puzzle.
I encourage you to think of alternative approaches that could be used.
However, I think that the puzzle nicely shows that sometimes slightly
reformulating the problem allows you to use standard data science tools to solve it.
I hope you enjoyed it!