Re-posted from: https://bkamins.github.io/julialang/2023/01/06/mit.html
Introduction
From January 17 to 20, 2023 my friends and I are going to give a short
course on introduction to Julia for data scientists at MIT.
The course is open so everyone is invited to join us. You can find the
detailed schedule and location information here.
We plan to cover various areas of data science tools and techniques including
predictive models, optimization, and simulation. One of the modules in the
course covers mining complex networks. Interestingly, some of the techniques
used there are similar to the methods I discussed last week in my
Flea circus for the New Year celebration post.
Therefore today I thought to present an example related to the material we are
planning to teach during the course that is covers analysis of Markov
models.
The post was written under Julia 1.8.4. As last week, I want to do the whole
analysis just using the functionality available in Base Julia.
The problem
The question I want to answer today is the following.
Assume we have an undirected graph and have an agent traveling on it. We assume
that the graph is connected, i.e. that every node is reachable from every other
node. When agent is in some node it moves to one of the nodes that is connected
by an edge with the source node. If a node is an end of multiple edges then one
of them is picked uniformly at random.
Now we are ready to formulate the question:
In the long run how often each node in the analyzed graph is going to be
visited?
We are going to try to solve this problem numerically.
The data
In order to concentrate on some concrete graph let us analyze the GitHub social
network graph, whose description can be found here.
First download the musae_git_edges.csv
to your working directory.
Start with inspecting its contents:
julia> raw = collect(eachline("musae_git_edges.csv"))
289004-element Vector{String}:
"id_1,id_2"
"0,23977"
"1,34526"
"1,2370"
"1,14683"
"1,29982"
"1,21142"
⋮
"37519,37678"
"19093,2347"
"37527,37596"
"37529,37601"
"37644,2347"
"25879,2347"
"25616,2347"
We see that every line, except the first, contains information about ends of
edges in our graph. Let us first parse it to numbers:
julia> edges = [parse.(Int, line) for line in split.(raw[2:end], ',')]
289003-element Vector{Vector{Int64}}:
[0, 23977]
[1, 34526]
[1, 2370]
[1, 14683]
[1, 29982]
[1, 21142]
[1, 20363]
⋮
[37519, 37678]
[19093, 2347]
[37527, 37596]
[37529, 37601]
[37644, 2347]
[25879, 2347]
[25616, 2347]
Next we check the minimum and maximum edge number in the data:
julia> extrema(Iterators.flatten(edges))
(0, 37699)
We see that we have 37700 nodes that are numbered using 0-based indexing.
Let us now create adjacency matrix am
of our graph.
We will put 1
in cell am[i, j]
if there is an edge between node i
and j
.
Otherwise we will leave am[i, j]
to be 0
.
In the process of creation of am
matrix, we will fix node numbers to start
with 1
:
julia> using SparseArrays
julia> from, to = getindex.(edges, 1) .+ 1, getindex.(edges, 2) .+ 1;
julia> am = sparse([from; to], [to; from], fill(1.0, 2 * length(edges)));
Checking if graph is connected
Let us check if the graph is connected using breadth first search
(this function is a bit slow; during the course at MIT you will see how this
process can be implemented faster using Graphs.jl):
julia> function allconnected(am)
seen = falses(size(am, 1))
to_visit = [1]
seen[1] = true
while !isempty(to_visit)
i = popfirst!(to_visit)
new_neighbors = [j for j in findnz(am[i, :])[1] if !seen[j]]
seen[new_neighbors] .= true
append!(to_visit, new_neighbors)
end
return all(seen)
end
allconnected (generic function with 1 method)
julia> allconnected(am)
true
Indeed the graph is connected.
Creating a transition matrix
Using the am
matrix it is easy to compute the transition matrix for our
process. Let an entry tm[i, j]
in the tm
matrix be the probability that
agent moves from node i
to node j
in one step. Clearly this probability
is 1 / deg[i]
where deg[i]
is degree of node i
(that is its number of
neighbors.
Let us compute the tm
matrix:
julia> deg = sum(am, dims=2);
julia> invdeg = 1 ./ deg;
julia> tm = am .* invdeg;
Let us check that indeed the probabilities in rows of tm
matrix add up to 1:
julia> sum(tm, dims=2)
37700×1 Matrix{Float64}:
1.0
1.0
1.0
1.0
1.0
1.0
0.9999999999999999
1.0
1.0
0.9999999999999998
⋮
0.9999999999999974
1.0
1.0000000000000007
1.0
1.0000000000000002
1.0
1.0
1.0
1.0
1.0
Computing stationary probability of visiting a node
We are almost done computing the long-run probability of visiting each node in
the analyzed graph.
We can start with any distribution of probabilities of visiting nodes. Assume,
for example, that it is uniform:
julia> p = fill(1 / 37700, 1, 37700)
1×37700 Matrix{Float64}:
2.65252e-5 2.65252e-5 2.65252e-5 … 2.65252e-5 2.65252e-5
From stochastic matrix theory we know that p * tm
is a distribution of
location of the agent after one step:
julia> p * tm
1×37700 Matrix{Float64}:
8.28912e-7 2.8377e-5 4.65354e-7 … 1.54045e-5 7.46794e-7
Now let us repeat this process many times until the vector p
stabilizes:
julia> function stationary(tm; eps=sqrt(eps()))
p = fill(1 / 37700, 1, 37700)
while true
newp = p * tm
if sum(abs(o - n) for (o, n) in zip(p, newp)) < eps
return newp
end
p = newp
end
end
stationary (generic function with 1 method)
julia> p = stationary(tm)
1×37700 Matrix{Float64}:
1.73009e-6 1.38407e-5 1.73009e-6 … 5.19026e-6 6.92034e-6
We know the stationary distribution of probability of visiting each node.
However, it would be interesting to understand it. As we will see it is
proportional to the degree of the node. Let us check it:
julia> extrema(vec(p) .- deg ./ sum(deg))
(-3.7003669919877247e-10, 9.529364854058532e-9)
Indeed, the deviation of p
from the distribution given by the node degree is
low as promised.
We have learned that in the long run how often each node is going to be
visited with probability proportional to its degree. This property can be
verified analytically to hold for any undirected connected graph. I encourage
you to perform the required computations.
Conclusions
I hope you found the example interesting. During the short course at MIT this
and many other examples will be explained in detail and we will discuss which
Julia packages can be used to do data analysis conveniently. Still, I wanted
to highlight, that a very appealing feature of Julia is related to the fact
how far one can go using just its standard functionality, without having to
install any packages.