Refreshing Mining Complex Networks book
Introduction
Two years ago together with Paweł Prałat and François Théberge we have written Mining Complex Networks book. Since it has received a positive response from the community we decided to start planning for its second edition, where we want to add material that covers recent advances in the field.
This decision prompted me to write something about graph analysis that is at the same time a nice application of the Julia language.
The post was written under Julia 1.9.0 and Graphs.jl 1.8.0.
The problem
Let me start with a business scenario. Assume you have a set of products in a store and know which of them are bought together. You represent this data as a graph where nodes are products and edges are representing co-purchase of products.
You want to find products that are not bought together, but that are fall into similar baskets. For example, assume you have two types of milk. Most likely they are not bought together, but they are both bought with e.g. bread. Such products could be called substitutes.
In the book we discuss some advanced methods that can be used for this task. In this post let me concentrate on a simple way to detect such products.
Assume that I have four products i
, j
, k
, and l
. If in the
graph we have edges i-j
, j-k
, k-l
, and l-i
, but do not have
edges i-k
and j-l
then we can say that i
and k
are substitutes
and j
and l
are substitutes (so this four nodes could be called
double-substitutes).
From a graph theory perspective the set of nodes {i, j, k, l}
form a
cycle of length 4 and they do not form any sorter cycle
(so this cycle has a hole inside it). Let us call such cycles minimal.
We can visualize this situation for example as follows:
i
/ \
l j
\ /
k
Our task for today is the following. Assume we have a random graph
with n
nodes. In this graph each edge is included with probability
p
, independently from every other edge. We want to check how
many minimal 4-cycles it contains.
Analytical solution
For a random graph it is possible to compute the expected number of minimal 4-cycles relatively easily.
Using the additivity of expectation we can concentrate on four-element
subsets of sets of nodes. For each such subset {i, j, k, l}
we have three possibilities that they can form a minimal 4-cycle:
i i i
/ \ / \ / \
l j l k k j
\ / \ / \ /
k j l
Note that the probability of observing any of them is p^4*(1-p)^2
(we see 4 edges and do not see 2 edges).
Therefore we can easily write a function that computes the expected number of such motifs as:
expected_count_empty_4cycle(n, p) = p^4*(1-p)^2 * 3 * binomial(n, 4)
Let us calculate the expected number of such motifs for n
equal to 10
and 100
and p
equal to 0.1
and 0.2
:
julia> expected_count_empty_4cycle.([10, 100], [0.1 0.2 0.3])
2×3 Matrix{Float64}:
0.05103 0.64512 2.50047
952.858 12046.0 46690.0
The function works fast and nice, but what if we made a mistake when defining it? With Julia it is easy to run a simulation checking the result.
Simulation solution, part 1
Let us start with a solution that reproduces the thinking we used to derive our analytical result:
function naive_count_empty_4cycle(g)
empty4cycle = 0
for i in 1:nv(g), j in i+1:nv(g), k in j+1:nv(g), l in k+1:nv(g)
empty4cycle += has_edge(g, i, j) && has_edge(g, j, k) &&
has_edge(g, k, l) && has_edge(g, l, i) &&
!has_edge(g, i, k) && !has_edge(g, j, l)
empty4cycle += has_edge(g, i, k) && has_edge(g, k, j) &&
has_edge(g, j, l) && has_edge(g, l, i) &&
!has_edge(g, i, j) && !has_edge(g, k, l)
empty4cycle += has_edge(g, i, j) && has_edge(g, j, l) &&
has_edge(g, l, k) && has_edge(g, k, i) &&
!has_edge(g, i, l) && !has_edge(g, j, k)
end
return empty4cycle
end
The function expects to get a Graphs.jl graph g
and traverses all
subsets of its nodes.
Let us check it:
julia> using Graphs
julia> using Random
julia> using Statistics
julia> Random.seed!(1234);
julia> [mean(naive_count_empty_4cycle(erdos_renyi(10, p)) for i in 1:1000)
for p in [0.1, 0.2, 0.3]]
3-element Vector{Float64}:
0.051
0.598
2.455
So the results look good for n=10
. But what about n=100
?
Let us check timing of a single run:
julia> @time naive_count_empty_4cycle(erdos_renyi(100, 0.1));
0.142410 seconds (274 allocations: 38.594 KiB)
We can see that the function is slow. If we wanted to run it 1000 times it would take us over 2 minutes. Let us think of a faster approach.
Simulation solution, part 2
The idea we can use is the following. Consider two nodes i
and j
and assume they
are not connected. Then it is enough to find common neighbors of i
and j
and check how many pairs of them are not connected.
Here is an example implementation of this idea:
function find_common_sorted!(common, ni, nj)
empty!(common)
iidx = 1
jidx = 1
while iidx <= length(ni) && jidx <= length(nj)
if ni[iidx] < nj[jidx]
iidx += 1
elseif ni[iidx] > nj[jidx]
jidx += 1
else
push!(common, ni[iidx])
iidx += 1
jidx += 1
end
end
nothing
end
function fast_count_empty_4cycle(g)
common = Int[]
empty4cycle = 0
for i in 1:nv(g)
ni = neighbors(g, i)
for j in i+1:nv(g)
has_edge(g, i, j) && continue
nj = neighbors(g, j)
find_common_sorted!(common, ni, nj)
for a in 1:length(common), b in a+1:length(common)
empty4cycle += !has_edge(g, common[a], common[b])
end
end
end
@assert iseven(empty4cycle)
return empty4cycle ÷ 2
end
Note that in the code we divide the number of empty cycles found by two as
assuming that nodes {i, j, k, l}
form an empty cycle we count it twice
(once starting with {i, k}
pair and once starting with {j, l}
pair).
Let us check our function:
julia> @time fast_count_empty_4cycle(erdos_renyi(100, 0.1));
0.001491 seconds (280 allocations: 40.047 KiB)
Indeed it is significantly faster. Therefore we can use it to also check the n=100
case:
julia> [mean(fast_count_empty_4cycle(erdos_renyi(10, p)) for i in 1:1000)
for p in [0.1, 0.2, 0.3]]
3-element Vector{Float64}:
0.051
0.598
2.455
julia> [mean(fast_count_empty_4cycle(erdos_renyi(100, p)) for i in 1:1000)
for p in [0.1, 0.2, 0.3]]
3-element Vector{Float64}:
954.364
12017.067
46574.827
The obtained results confirm our analytical solution, so we have some more confidence that indeed we have derived it correctly.
Conclusions
I hope you found the problem of finding the number of minimal 4-cycles interesting. Personally, really enjoyed coding it in Julia. In particular note that in Julia with our faster version of code we almost do not do any allocations:
julia> gr = erdos_renyi(1000, 0.1); @time fast_count_empty_4cycle(gr)
2.534875 seconds (4 allocations: 496 bytes)
10169170
julia> expected_count_empty_4cycle.(1000, 0.1)
1.0064361314250002e7
and the code runs quite fast.