Introduction

From January 17 to 20, 2023 my friends and I are going to give a short course on introduction to Julia for data scientists at MIT. The course is open so everyone is invited to join us. You can find the detailed schedule and location information here.

We plan to cover various areas of data science tools and techniques including predictive models, optimization, and simulation. One of the modules in the course covers mining complex networks. Interestingly, some of the techniques used there are similar to the methods I discussed last week in my Flea circus for the New Year celebration post.

Therefore today I thought to present an example related to the material we are planning to teach during the course that is covers analysis of Markov models.

The post was written under Julia 1.8.4. As last week, I want to do the whole analysis just using the functionality available in Base Julia.

The problem

The question I want to answer today is the following.

Assume we have an undirected graph and have an agent traveling on it. We assume that the graph is connected, i.e. that every node is reachable from every other node. When agent is in some node it moves to one of the nodes that is connected by an edge with the source node. If a node is an end of multiple edges then one of them is picked uniformly at random.

Now we are ready to formulate the question:

In the long run how often each node in the analyzed graph is going to be visited?

We are going to try to solve this problem numerically.

The data

In order to concentrate on some concrete graph let us analyze the GitHub social network graph, whose description can be found here.

First download the musae_git_edges.csv to your working directory.

Start with inspecting its contents:

julia> raw = collect(eachline("musae_git_edges.csv"))
289004-element Vector{String}:
 "id_1,id_2"
 "0,23977"
 "1,34526"
 "1,2370"
 "1,14683"
 "1,29982"
 "1,21142"
 ⋮
 "37519,37678"
 "19093,2347"
 "37527,37596"
 "37529,37601"
 "37644,2347"
 "25879,2347"
 "25616,2347"

We see that every line, except the first, contains information about ends of edges in our graph. Let us first parse it to numbers:

julia> edges = [parse.(Int, line) for line in split.(raw[2:end], ',')]
289003-element Vector{Vector{Int64}}:
 [0, 23977]
 [1, 34526]
 [1, 2370]
 [1, 14683]
 [1, 29982]
 [1, 21142]
 [1, 20363]
 ⋮
 [37519, 37678]
 [19093, 2347]
 [37527, 37596]
 [37529, 37601]
 [37644, 2347]
 [25879, 2347]
 [25616, 2347]

Next we check the minimum and maximum edge number in the data:

julia> extrema(Iterators.flatten(edges))
(0, 37699)

We see that we have 37700 nodes that are numbered using 0-based indexing.

Let us now create adjacency matrix am of our graph. We will put 1 in cell am[i, j] if there is an edge between node i and j. Otherwise we will leave am[i, j] to be 0. In the process of creation of am matrix, we will fix node numbers to start with 1:

julia> using SparseArrays

julia> from, to = getindex.(edges, 1) .+ 1, getindex.(edges, 2) .+ 1;

julia> am = sparse([from; to], [to; from], fill(1.0, 2 * length(edges)));

Checking if graph is connected

Let us check if the graph is connected using breadth first search (this function is a bit slow; during the course at MIT you will see how this process can be implemented faster using Graphs.jl):

julia> function allconnected(am)
           seen = falses(size(am, 1))
           to_visit = [1]
           seen[1] = true
           while !isempty(to_visit)
               i = popfirst!(to_visit)
               new_neighbors = [j for j in findnz(am[i, :])[1] if !seen[j]]
               seen[new_neighbors] .= true
               append!(to_visit, new_neighbors)
           end
           return all(seen)
       end
allconnected (generic function with 1 method)

julia> allconnected(am)
true

Indeed the graph is connected.

Creating a transition matrix

Using the am matrix it is easy to compute the transition matrix for our process. Let an entry tm[i, j] in the tm matrix be the probability that agent moves from node i to node j in one step. Clearly this probability is 1 / deg[i] where deg[i] is degree of node i (that is its number of neighbors.

Let us compute the tm matrix:

julia> deg = sum(am, dims=2);

julia> invdeg = 1 ./ deg;

julia> tm = am .* invdeg;

Let us check that indeed the probabilities in rows of tm matrix add up to 1:

julia> sum(tm, dims=2)
37700×1 Matrix{Float64}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 0.9999999999999999
 1.0
 1.0
 0.9999999999999998
 ⋮
 0.9999999999999974
 1.0
 1.0000000000000007
 1.0
 1.0000000000000002
 1.0
 1.0
 1.0
 1.0
 1.0

Computing stationary probability of visiting a node

We are almost done computing the long-run probability of visiting each node in the analyzed graph.

We can start with any distribution of probabilities of visiting nodes. Assume, for example, that it is uniform:

julia> p = fill(1 / 37700, 1, 37700)
1×37700 Matrix{Float64}:
 2.65252e-5  2.65252e-5  2.65252e-5  …  2.65252e-5  2.65252e-5

From stochastic matrix theory we know that p * tm is a distribution of location of the agent after one step:

julia> p * tm
1×37700 Matrix{Float64}:
 8.28912e-7  2.8377e-5  4.65354e-7  …  1.54045e-5  7.46794e-7

Now let us repeat this process many times until the vector p stabilizes:

julia> function stationary(tm; eps=sqrt(eps()))
           p = fill(1 / 37700, 1, 37700)
           while true
               newp = p * tm
               if sum(abs(o - n) for (o, n) in zip(p, newp)) < eps
                   return newp
               end
               p = newp
           end
       end
stationary (generic function with 1 method)

julia> p = stationary(tm)
1×37700 Matrix{Float64}:
 1.73009e-6  1.38407e-5  1.73009e-6  …  5.19026e-6  6.92034e-6

We know the stationary distribution of probability of visiting each node. However, it would be interesting to understand it. As we will see it is proportional to the degree of the node. Let us check it:

julia> extrema(vec(p) .- deg ./ sum(deg))
(-3.7003669919877247e-10, 9.529364854058532e-9)

Indeed, the deviation of p from the distribution given by the node degree is low as promised.

We have learned that in the long run how often each node is going to be visited with probability proportional to its degree. This property can be verified analytically to hold for any undirected connected graph. I encourage you to perform the required computations.

Conclusions

I hope you found the example interesting. During the short course at MIT this and many other examples will be explained in detail and we will discuss which Julia packages can be used to do data analysis conveniently. Still, I wanted to highlight, that a very appealing feature of Julia is related to the fact how far one can go using just its standard functionality, without having to install any packages.