This is my last blog post with the previews of an upcoming Julia 1.11 release. The functionality I want to cover today is an option of defining an entry point to the Julia script.

The code was tested under Julia 1.11 RC1.

Traditionally when writing a Julia script you assumed that when you run a `julia some_script.jl`

command.
In this case Julia sequentially executes the contents of the `some_script.jl`

file and terminates.

When I was writing Julia code that was meant to be executed in this way my typical approach was to always encapsulate all executed code in functions. In this way we can avoid many problems that are introduced by writing code that is executed in global scope, including some of the common issues:

- scope of variables (no need to think about the
`global`

keyword); - performance (code inside functions is compiled, thus fast);
- an accidental use of the same name for different objects in global scope spaghetti code (I think everyone has been bit by this issue);
- pollution of RAM memory (large objects that have bindings in global scope are kept alive and it is easy to forget to unbind them to alow garbage collection).

Therefore a typical structure of my code was:

```
...
some definitions of data structures and code inside functions
...
function main(ARGS)
...
the operations I want to have executed by the script
...
end
main(ARGS)
```

This is a style that is natural for programmers used to such languages as e.g. C, where the `main`

function is an entry point.

Julia 1.11 adds an option to mark the `main`

function as an entry point. It makes sure that `main(ARGS)`

gets called after execution of the script.

It is quite easy to mark the `main`

function as an entry point. It is enough to just replace `main(ARGS)`

with `(@main)(ARGS)`

in my example above.
Thus, starting from Julia 1.11 I can write my scripts as:

```
...
some definitions of data structures and code inside functions
...
function (@main)(ARGS)
...
the operations I want to have executed by the script
...
end
```

This seemingly small change is in my opinion significant as it standardizes the way Julia scripts are written. And such standardization is a good feature improving code readability and maintainability. Additionally, this feature helps in unification of interactive and compiled workflows of using Julia.

Let me show a minimal working example of writing a script using the `@main`

macro:

```
$ julia -e "using InteractiveUtils; (@main)(args) = versioninfo()"
Julia Version 1.11.0-rc1
Commit 3a35aec36d (2024-06-25 10:23 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 12 × 12th Gen Intel(R) Core(TM) i7-1250U
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
Threads: 1 default, 0 interactive, 1 GC (on 12 virtual cores)
$
```

In this example we invoke the `versioninfo`

function inside the `main(args)`

function defined using the `@main`

macro.
Note that we did not have to explicitly call the `main`

function in the code. It was invoked automatically because it has
been created using the `@main`

macro.

Now I hope you know what `@main`

macro does and how to use it in Julia 1.11. Enjoy scripting with Julia!

We are now in RC1 phase of Julia 1.11.
One small but important addition it is making `IdSet`

a public type.
Today I want to discuss when this type is useful.

The code was tested under Julia 1.11 RC1.

There are three basic ways to test equality in Julia:

- the
`==`

operator; - the
`isequal`

function; - the
`===`

operator.

I have ordered these comparison operators by their level of strictness.

The `==`

operator is most *loose*. It can return `true`

, `false`

or `missing`

. The `missing`

value is returned if any of the compared values are missing (or recursively contain `missing`

value). For floating point numbers it assumes that `0.0`

is equal to `-0.0`

and that `NaN`

is not equal to `NaN`

. Let us see the last case in action as it can be surprising if you have never seen this before:

```
julia> NaN == NaN
false
```

Next is `isequal`

that is more strict. It guarantees to return `true`

or `false`

. It treats all floating-point `NaN`

values as equal to each other, treats `-0.0`

as unequal to `0.0`

, and `missing`

as equal to `missing`

. It compares objects by their value, not by their identity. So, for example, two different vectors having the same contents are considered equal:

```
julia> v1 = [1, 2, 3]
3-element Vector{Int64}:
1
2
3
julia> v2 = [1, 2, 3]
3-element Vector{Int64}:
1
2
3
julia> isequal(v1, v2)
true
```

Finally we have `===`

, which is most strict. It returns `true`

or `false`

. However, `true`

is returned if and only if the compared values are indistinguishable. They must have the same type. If their types are identical, mutable objects are compared by address in memory and immutable objects (such as numbers) are compared by contents at the bit level. Therefore the `v1`

and `v2`

vectors we created above are not equal when compared with `===`

:

```
julia> v1 === v2
false
```

You might ask about `NaN`

. We saw that we talked about before. Here the situation is complicated. They can be equal or be not equal. Since `NaN`

values are immutable `===`

compares them on bit level. So we have:

```
julia> Float16(NaN) == Float32(NaN)
false
julia> isequal(Float16(NaN), Float32(NaN))
true
julia> Float16(NaN) === Float32(NaN)
false
julia> Float16(NaN) == Float16(NaN)
false
julia> isequal(Float16(NaN), Float16(NaN))
true
julia> Float16(NaN) === Float16(NaN)
true
```

Thus, you have to be careful. Each of the three comparison methods I discussed have their uses and it is well worth learning them.

Standard sets in Julia, created using the `Set`

constructor use `isequal`

to test for equality. Therefore we have:

```
julia> Set([v1, v2])
Set{Vector{Int64}} with 1 element:
[1, 2, 3]
```

We see that `v1`

and `v2`

got de-duplicated because they are equal with respect to `isequal`

since they have the same contents. This is often what the user wants.

However, sometimes we want to track actual objects (irrespective of their contents). This is especially important when working with mutable structures. In this case `IdSet`

is useful:

```
julia> IdSet{Vector{Int}}([v1, v2])
IdSet{Vector{Int64}} with 2 elements:
[1, 2, 3]
[1, 2, 3]
```

Note that we needed to specify the type of the values stored in `IdSet`

. As an exception the `IdSet()`

is allowed (not requiring you to pass the stored object type specification) and in this case an empty `IdSet{Any}`

is created.

Now you might ask when in practice `IdSet`

is most useful. I needed it in my coding practice most often when I worked with nested mutable containers that potentially could contain circular references. In such case using `IdSet`

allows you to easily keep track of the list of mutable objects already seen and avoid an infinite loop or stack overflow if you e.g. use recursion to work with such a deeply nested data structure.

Currently Julia 1.11 is being in its beta testing phase.
One of the changes it introduces is redesign of internal representation of arrays.
This redesign, from the user perspective, promises to speed up certain operations.
One of the common ones that I use often is `push!`

. Therefore today I decided to benchmark it.

The tests were performed under Julia 1.11.0-beta2 and Julia 1.10.1. The benchmarks use BenchmarkTools.jl 1.5.0.

Here is the function we are going to use for our tests:

```
using BenchmarkTools
function test(n)
x = Int[]
for i in 1:n
push!(x, i)
end
return x
end
```

This is the most basic test of the performance of `push!`

operation.
I want to check the performance for various numbers of `push!`

operations.

Let us run the tests first under Julia 1.11.0-beta2:

```
julia> @benchmark test(100)
BenchmarkTools.Trial: 10000 samples with 849 evaluations.
Range (min … max): 129.800 ns … 1.682 μs ┊ GC (min … max): 0.00% … 85.46%
Time (median): 194.582 ns ┊ GC (median): 0.00%
Time (mean ± σ): 232.033 ns ± 125.847 ns ┊ GC (mean ± σ): 15.55% ± 19.08%
Memory estimate: 1.94 KiB, allocs estimate: 4.
julia> @benchmark test(10_000)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 14.400 μs … 6.167 ms ┊ GC (min … max): 0.00% … 97.67%
Time (median): 30.300 μs ┊ GC (median): 0.00%
Time (mean ± σ): 51.585 μs ± 148.402 μs ┊ GC (mean ± σ): 21.45% ± 10.84%
Memory estimate: 326.41 KiB, allocs estimate: 14.
julia> @benchmark test(1_000_000)
BenchmarkTools.Trial: 808 samples with 1 evaluation.
Range (min … max): 2.993 ms … 90.221 ms ┊ GC (min … max): 0.00% … 95.38%
Time (median): 4.674 ms ┊ GC (median): 19.53%
Time (mean ± σ): 6.176 ms ± 8.774 ms ┊ GC (mean ± σ): 35.69% ± 20.33%
Memory estimate: 17.41 MiB, allocs estimate: 24.
julia> @benchmark test(100_000_000)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
Range (min … max): 808.177 ms … 1.020 s ┊ GC (min … max): 9.38% … 26.13%
Time (median): 959.266 ms ┊ GC (median): 25.01%
Time (mean ± σ): 944.380 ms ± 81.448 ms ┊ GC (mean ± σ): 22.25% ± 6.41%
Memory estimate: 2.95 GiB, allocs estimate: 42.
```

Now the same tests under Julia 1.10.1:

```
julia> @benchmark test(100)
BenchmarkTools.Trial: 10000 samples with 199 evaluations.
Range (min … max): 359.296 ns … 10.699 μs ┊ GC (min … max): 0.00% … 82.66%
Time (median): 923.116 ns ┊ GC (median): 0.00%
Time (mean ± σ): 959.401 ns ± 347.833 ns ┊ GC (mean ± σ): 2.21% ± 6.25%
Memory estimate: 1.92 KiB, allocs estimate: 4.
julia> @benchmark test(10_000)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 47.700 μs … 4.938 ms ┊ GC (min … max): 0.00% … 94.66%
Time (median): 103.200 μs ┊ GC (median): 0.00%
Time (mean ± σ): 133.997 μs ± 158.472 μs ┊ GC (mean ± σ): 7.43% ± 6.81%
Memory estimate: 326.55 KiB, allocs estimate: 9.
julia> @benchmark test(1_000_000)
BenchmarkTools.Trial: 504 samples with 1 evaluation.
Range (min … max): 6.729 ms … 88.534 ms ┊ GC (min … max): 0.00% … 91.83%
Time (median): 8.773 ms ┊ GC (median): 0.00%
Time (mean ± σ): 9.924 ms ± 5.161 ms ┊ GC (mean ± σ): 7.30% ± 9.24%
Memory estimate: 9.78 MiB, allocs estimate: 14.
julia> @benchmark test(100_000_000)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
Range (min … max): 1.184 s … 1.394 s ┊ GC (min … max): 8.36% … 6.56%
Time (median): 1.275 s ┊ GC (median): 7.46%
Time (mean ± σ): 1.282 s ± 86.217 ms ┊ GC (mean ± σ): 6.89% ± 5.29%
Memory estimate: 1019.60 MiB, allocs estimate: 23.
```

From the tests we can see that:

- The new implementation in Julia 1.11 is faster for various values of
`n`

. This is very nice. - The new implementation in Julia 1.11 does more allocations, has higher memory estimate, and, in consequence spends more time in garbage collection. This means that in cases when available RAM is scarce the code performance could be affected.

Today I decided to follow up on my last post solving a coin-tossing game. This time, instead of simulation I want to use numerical approach (and so probably a bit harder).

The post was written under Julia 1.10.1 and Graphs.jl 1.11.0.

Let me describe the setting of a game first (it is an extension of this post).

Assume Alice and Bob toss a fair coin. In each toss head (`h`

) or tail (`t`

) can show up with equal probability.

Alice and Bob choose some sequence of `h`

and `t`

they are waiting for. We assume that the chosen sequences have the same length and are different.
For example Alice could choose `htht`

and Bob `tthh`

.

The winner of the game is the person who saw their sequence first.

The question we ask if for a fixed sequence length `n`

we can get cycles, that is, for example, that sequence `s1`

beats `s2`

, `s2`

beats `s3`

, and `s3`

beats `s1`

.

To answer this question we will represent the game as a Markov process.

First we create a transition matrix of a Markov chain tracking current `n`

element sequence in the game we consider.
Here is the code:

```
function markov(size::Integer)
idx2states = vec(join.(Iterators.product([['h', 't'] for _ in 1:size]...)))
states2idx = Dict(idx2states .=> eachindex(idx2states))
P = zeros(2^size, 2^size)
for state in idx2states
for next in ("h", "t")
nextstate = chop(state, head=1, tail=0) * next
P[states2idx[state], states2idx[nextstate]] = 0.5
end
end
return P, idx2states, states2idx
end
```

What we do in it is as follows:

`idx2states`

vector keeps track of all`h`

and`t`

sequences that have length`n`

(i.e. it is a mapping from state number to state signature).`states2idx`

is an inverse mapping - from state signature to state number.`P`

is transition matrix of our chain. Note that from the sequence`ab...`

(where all elements are`h`

or`t`

) we go to sequence`b...h`

or`b...t`

with equal probability.

We now need to create a function that is aware of Alice’s and Bob’s chosen sequences and make them terminating. We want to compute the probabilities of ending up in Alice’s and Bob’s state. Here is the code:

```
function game(P, states2idx, alice, bob)
P_game = copy(P)
alice_idx, bob_idx = states2idx[alice], states2idx[bob]
P_game[alice_idx, :] .= 0.0
P_game[alice_idx, alice_idx] = 1.0
P_game[bob_idx, :] .= 0.0
P_game[bob_idx, bob_idx] = 1.0
n = length(states2idx)
terminal = fill(1 / n, 1, n) * P_game^(2^30)
return terminal[states2idx[alice]], terminal[states2idx[bob]]
end
```

Note that we first update the `P_game`

matrix to make `alice_idx`

and `bob_idx`

states terminating. Then, since I was lazy, we assume we make `2^30`

steps of the process (fortunately in Julia it is fast).
Observe that initially all states are equally probably, so `terminal`

matrix keeps information about long term probabilities of staying in all possible states.
We extract the probabilities of Alice’s and Bob’s states and return them.

We are now ready for a final move. We can consider all possible preferred sequences of Alice and Bob and create a graph that keeps track of which sequences beat other sequences:

```
using Graphs
function analyze_game(size::Integer, details::Bool=true)
P, idx2states, states2idx = markov(size)
g = SimpleDiGraph(length(states2idx))
details && println("\nWinners:")
for alice in idx2states, bob in idx2states
alice > bob || continue
alice_win, bob_win = game(P, states2idx, alice, bob)
if alice_win > 0.51
winner = "alice"
add_edge!(g, states2idx[alice], states2idx[bob])
elseif bob_win > 0.51
winner = "bob"
add_edge!(g, states2idx[bob], states2idx[alice])
else
winner = "tie (or close :))"
end
details && println(alice, " vs ", bob, ": ", winner)
end
cycles = simplecycles(g)
if !isempty(cycles)
min_len = minimum(length, cycles)
filter!(x -> length(x) == min_len, cycles)
end
println("\nCycles:")
for cycle in cycles
println(idx2states[cycle])
end
end
```

Note that I used `0.51`

threshold for detection of dominance of one state over the other. We could do it better, but in practice for small `n`

it is enough and working this way is simpler numerically.
What this threshold means is that we want to be “sure” that one player beats the other.
In our code we do two things:

- optionally print the information which state beats which state;
- print information about cycles found in beating patterns (we keep only cycles of the shortest length).

Let us check the code. Start with sequences of length 2:

```
julia> analyze_game(2)
Winners:
th vs hh: alice
th vs ht: tie (or close :))
ht vs hh: tie (or close :))
tt vs hh: tie (or close :))
tt vs th: tie (or close :))
tt vs ht: bob
Cycles:
```

We see that only `th`

beats `hh`

and `ht`

beats `tt`

(this is a symmetric case). We did not find any cycles.

Let us check 3:

```
julia> analyze_game(3)
Winners:
thh vs hhh: alice
thh vs hth: tie (or close :))
thh vs hht: alice
thh vs htt: tie (or close :))
hth vs hhh: alice
hth vs hht: bob
tth vs hhh: alice
tth vs thh: alice
tth vs hth: alice
tth vs hht: tie (or close :))
tth vs tht: alice
tth vs htt: bob
hht vs hhh: tie (or close :))
tht vs hhh: alice
tht vs thh: tie (or close :))
tht vs hth: tie (or close :))
tht vs hht: bob
tht vs htt: tie (or close :))
htt vs hhh: alice
htt vs hth: tie (or close :))
htt vs hht: bob
ttt vs hhh: tie (or close :))
ttt vs thh: bob
ttt vs hth: bob
ttt vs tth: tie (or close :))
ttt vs hht: bob
ttt vs tht: bob
ttt vs htt: bob
Cycles:
["thh", "hht", "htt", "tth"]
```

We now have the cycle. The shortest cycle has length 4 and it is unique. Let us see what happens for patterns of length 4 (I suppress printing the details as there are too many of them):

```
julia> analyze_game(4, false)
Cycles:
["thhh", "hhth", "hthh"]
["thhh", "hhtt", "ttth"]
["hhth", "hthh", "thht"]
["hhth", "thtt", "tthh"]
["hthh", "hhtt", "thth"]
["hthh", "hhtt", "ttht"]
["hthh", "hhtt", "ttth"]
["thht", "hhtt", "ttth"]
["htht", "thtt", "tthh"]
["thtt", "tthh", "hhht"]
["thtt", "htth", "ttht"]
["thtt", "httt", "ttht"]
["tthh", "hhht", "htth"]
["tthh", "hhht", "httt"]
```

In this case we have many cycles that are even shorter as they have length three.

The conclusion is that the game is slightly surprising. We can have cycles of dominance between sequences. I hope you liked this example. Happy summer!

]]>Two weeks ago I wrote a post about a simple coin tossing game. Today let me follow up on it with a bit more difficult question and a slightly changed implementation strategy.

The post was written under Julia 1.10.1, DataFrames.jl 1.6.1, and StatsBase.jl 0.34.3.

Let me describe the setting of a game first (it is similar to what I described in this post).

Assume Alice and Bob toss a fair coin `n`

times. In each toss head (`h`

) or tail (`t`

) can show up with equal probability.

Alice counts the number of times a `ht`

sequence showed.
Bob counts the number of times a `hh`

sequence showed.

The winner of the game is the person who saw a bigger number of occurrences of their favorite sequence.
So for example for `n=3`

. If we get `hhh`

then Bob wins (seeing 2 occurrences of `hh`

, and Alice saw 0 occurrences of `ht`

). If we get `hht`

there is a tie (both patterns ocurred once). If we get `tht`

Alice wins.

The questions are:

- Who, on the average sees more occurrences of their favorite pattern?
- Who is more likely to win this game?

Let us try to answer these questions using Julia as usual.

We start by writing a simulator of a single game:

```
using Random
function play(n::Integer)
seq = randstring("ht", n)
return (hh=count("hh", seq, overlap=true),
ht=count("ht", seq, overlap=true))
end
```

The function is not optimized for speed (as we could even avoid storing the whole sequence),
but I think it nicely shows how powerful library functions in Julia are. The `randstring`

function
allows us to generate random strings. In this case consisting of a random sequence of `h`

and `t`

.
Next the `count`

function allows us to count the number of occurrences of desired patterns.
Note that we use the `overlap=true`

keyword argument to count all occurrences of the pattern
(by default only disjoint occurrences are counted).

Let us check the output of a single run of the game:

```
julia> play(10)
(hh = 3, ht = 3)
```

In my case (I did not seed the random number generator) we see that for `n=10`

we got a sequence that
had both `3`

occurrences of `hh`

and `ht`

, so it is a tie.

Here is a simulator that, for a given `n`

, runs the game `reps`

times and aggregates the results:

```
using DataFrames
using Statistics
using StatsBase
function sim_play(n::Integer, reps::Integer)
df = DataFrame([play(n) for _ in 1:reps])
df.winner = cmp.(df.hh, df.ht)
agg = combine(df,
["hh", "ht"] .=> [mean std skewness],
"winner" .=>
[x -> mean(==(i), x) for i in -1:1] .=>
["ht_win", "tie", "hh_win"])
return insertcols!(agg, 1, "n" => n)
end
```

What we do in the code is as follows. First we run the game `reps`

times and transform a result into a `DataFrame`

.
Next we add a column denoting the winner of the game. In the `"winner"`

column 1 means that `hh`

won, 0 means a tie, and -1 means that `ht`

won.
Finally we compute the following aggregates (using transformation minilanguage; if you do not have much experience with it you can have a look at this post):

- mean, standard deviation, and skewness of
`hh`

and`ht`

counts; - probability that
`ht`

wins, that there is a tie and that`hh`

wins.

Here is the result of running the code for `reps=1_000_000`

and `n`

varying from 2 to 16:

```
julia> Random.seed!(1234);
julia> reduce(vcat, [sim_play(n, 1_000_000) for n in 2:16])
15×10 DataFrame
Row │ n hh_mean ht_mean hh_std ht_std hh_skewness ht_skewness ht_win tie hh_win
│ Int64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 2 0.25068 0.249825 0.433405 0.432912 1.15052 1.15578 0.249825 0.499495 0.25068
2 │ 3 0.499893 0.499595 0.706871 0.5 1.06068 0.00162 0.374385 0.375765 0.24985
3 │ 4 0.751224 0.748855 0.902063 0.559496 1.0232 0.00312512 0.373833 0.37559 0.250577
4 │ 5 1.00168 1.00012 1.06192 0.612535 0.940274 -6.5033e-5 0.406445 0.28037 0.313185
5 │ 6 1.25098 1.2493 1.19926 0.661162 0.869559 -0.0012833 0.437276 0.233841 0.328883
6 │ 7 1.49972 1.50011 1.32213 0.707523 0.812272 -0.00190003 0.437774 0.234531 0.327695
7 │ 8 1.75064 1.74802 1.43616 0.750169 0.76024 0.00319491 0.440714 0.211252 0.348034
8 │ 9 1.99906 2.00108 1.53902 0.789413 0.715722 0.000107041 0.451749 0.189353 0.358898
9 │ 10 2.24857 2.25009 1.63787 0.829086 0.676735 -0.00207707 0.45343 0.184585 0.361985
10 │ 11 2.50092 2.50007 1.73343 0.867326 0.646397 0.000650687 0.454418 0.175059 0.370523
11 │ 12 2.74753 2.75065 1.81994 0.901478 0.621238 -0.00118389 0.458332 0.164575 0.377093
12 │ 13 2.99635 3.00128 1.90199 0.935108 0.597227 0.00212776 0.460248 0.159239 0.380513
13 │ 14 3.2469 3.25101 1.9814 0.96887 0.575535 -0.000255108 0.460817 0.154523 0.38466
14 │ 15 3.50074 3.49934 2.05981 0.998945 0.55527 0.000827465 0.461547 0.147699 0.390754
15 │ 16 3.75258 3.7513 2.13521 1.03027 0.538056 0.000772964 0.463627 0.142931 0.393442
```

What do we learn from these results?

On the average `hh`

and `ht`

occur the same number of times.
We see this from `"hh_mean"`

and `"ht_mean"`

columns.
This is expected. As in a given sequence of two observations `hh`

and `ht`

have the same
probability of occurrence (0.25) the result just follows the linearity of expected value.
We can see that as we increase `n`

the values in these columns increase roughly by `0.25`

.

However, the probability of `ht`

winning is higher than the probability of `hh`

winning
(except `n=2`

when it is equal). We can see this from the `"ht_win"`

and `"hh_win"`

columns.
This is surprising as the patterns occur, on the average the same number of times.

To understand the phenomenon we can look at the `"hh_std"`

, `"ht_std"`

,
`"hh_skewness"`

, and `"ht_skewness"`

columns.
We can clearly see that `hh`

pattern count has a higher standard deviation and for `n>2`

it is positively skewed
(while `ht`

has zero skewness).
This means that `hh`

counts are more spread (i.e. they can be high, but also low).
Additionally we have few quite high values balanced by more low values for `hh`

relatively to `ht`

(as the means for both patterns are the same). This, in turn, means that if `hh`

wins over `ht`

then it wins by a larger margin, but it happens less rarely than seeing `ht`

winning over `hh`

.

The core reason for this behavior was discussed in my previous post. The `hh`

values can cluster (as e.g. in the `hhh`

pattern), while `ht`

patterns cannot overlap.

I hope you found this puzzle interesting. If you are interested how the properties we described can be proven analytically I recommend you check out this paper.

]]>Next week I organize WAW2024 conference. The event covers various aspects of theoretical and applied modeling of networks.

As an introduction I want to run a simulation of an example problem. Consider a random graph with a probability of an edge between two nodes equal to `p`

. Next, assume that we pick an edge uniformly at random from this graph and then remove two nodes forming this edge from the graph as matched. The question is what is the expected fraction of nodes that are going to be matched by this process.

Today, I will investigate this problem using simulation.

The post was written under Julia 1.10.1, Graphs.jl 1.11.0, and DataFrames.jl 1.6.1.

Here is a simulator of our greedy matching process. In the simulation we traverse all edges of the graph in a random order.
In the `matched`

vector we keep track of which nodes have been already matched.

```
using Graphs
using Random
using Statistics
function run_sim(n::Integer, p::Real)
g = erdos_renyi(n, p)
matched = fill(false, n)
for e in shuffle!(collect(edges(g)))
n1, n2 = e.src, e.dst
if !(matched[n1] || matched[n2])
matched[n1] = true
matched[n2] = true
end
end
return mean(matched)
end
```

Let us now test our simulator for a graph on `10000`

nodes and `p`

varying from `0.00001`

to `0.1`

(on logarithmic scale).

```
julia> using DataFrames
julia> df = DataFrame(p=Float64[], rep=Int[], res=Float64[])
0×3 DataFrame
Row │ p rep res
│ Float64 Int64 Float64
─────┴─────────────────────────
julia> ps = [10.0^i for i in -5:-1]
5-element Vector{Float64}:
1.0e-5
0.0001
0.001
0.010000000000000002
0.1
julia> Random.seed!(1234);
julia> @time for p in ps, rep in 1:16
push!(df, (p, rep, run_sim(10_000, p)))
end
79.190585 seconds (438.02 M allocations: 14.196 GiB, 7.13% gc time, 0.36% compilation time)
julia> df
80×3 DataFrame
Row │ p rep res
│ Float64 Int64 Float64
─────┼─────────────────────────
1 │ 1.0e-5 1 0.0948
2 │ 1.0e-5 2 0.094
3 │ 1.0e-5 3 0.097
4 │ 1.0e-5 4 0.0892
5 │ 1.0e-5 5 0.0848
6 │ 1.0e-5 6 0.093
⋮ │ ⋮ ⋮ ⋮
76 │ 0.1 12 0.9992
77 │ 0.1 13 0.999
78 │ 0.1 14 0.999
79 │ 0.1 15 0.999
80 │ 0.1 16 0.9988
69 rows omitted
```

The simulation took a bit over 1 minute, mainly due to the `p=0.1`

case which generates a lot of edges in the graph.
Let us aggregate the obtained data to get the mean and standard error, and range of the results over all values of `p`

:

```
julia> combine(groupby(df, "p"),
"p" => (x -> 10_000 * first(x)) => "mean_degree",
"res" => mean,
"res" => (x -> std(x) / sqrt(length(x))) => "res_se",
"res" => extrema)
5×5 DataFrame
Row │ p mean_degree res_mean res_se res_extrema
│ Float64 Float64 Float64 Float64 Tuple…
─────┼───────────────────────────────────────────────────────────────
1 │ 1.0e-5 0.1 0.090975 0.000842986 (0.0848, 0.097)
2 │ 0.0001 1.0 0.499888 0.00190971 (0.4848, 0.5134)
3 │ 0.001 10.0 0.909425 0.000523729 (0.9062, 0.9126)
4 │ 0.01 100.0 0.990162 0.000257694 (0.9888, 0.992)
5 │ 0.1 1000.0 0.999 5.47723e-5 (0.9986, 0.9994)
```

We can see that the sharp increase of fraction of matched nodes happens around mean degree of 1 in the graph.
Additionally we see that even for high `p`

we do not match every node in the greedy matching process.
Finally the obtained results are relatively well concentrated around the mean.

If you want to see how this problem can be solved analytically I recommend you to read this paper. Using the formulas derived there we can compare our simulation results with the asymptotic theory:

```
julia> (10_000 .* ps) ./ (10_000 .* ps .+ 1)
5-element Vector{Float64}:
0.09090909090909091
0.5
0.9090909090909091
0.9900990099009901
0.999000999000999
```

Indeed we see that the match is quite good.

If such problems are interesting for you I invite you to join us during WAW2024 conference.

]]>I have been writing my blog for over 4 years now (without missing a single week). My first post was on May 10, 2020, you can find it here.

There is a small change in how I distribute my content. Starting from last week I made the repository of my blog public, so if you find any mistake please do not hesitate to open a Pull Request here.

To celebrate this I decided to go back to my favorite topic – mathematical puzzles. Today I use a classic coin-tossing game example.

The post was written under Julia 1.10.1, StatsBase.jl 0.34.4, FreqTables.jl 0.4.6, and BenchmarkTools.jl 1.5.0.

Assume Alice and Bob toss a fair coin. Alice wins if after tossing a head (`H`

) tail (`T`

) is tossed, that is we see an `HT`

sequence.
Bob wins if two consecutive heads are tossed, that is we see an `HH`

sequence.

The questions are:

- Who is more likely to win this game?
- If only Alice played, how long, on the average, would she wait till
`HT`

was tossed? - If only Bob played, how long, on the average, would he wait till
`HH`

was tossed?

Let us try to answer these questions using Julia.

This code simulates the situation when Alice and Bob play together:

```
function both()
a = rand(('H', 'T'))
while true
b = rand(('H', 'T'))
if a == 'T'
a = b
else
return b == 'H' ? "Bob" : "Alice"
end
end
end
```

The `both`

function returns `"Bob"`

if Bob wins, and `"Alice"`

otherwise.
From the code it should be already clear that both players have the same probability of winning.
The only way to terminate the simulation is `return b == 'H' ? "Bob" : "Alice"`

and this condition is symmetric
with respect to Alice and Bob. Let us confirm this by running a simulation:

```
julia> using FreqTables, Random
julia> Random.seed!(1234);
julia> freqtable([both() for _ in 1:100_000_000])
2-element Named Vector{Int64}
Dim1 │
──────┼─────────
A │ 50000012
B │ 49999988
```

Indeed, the number of times Alice and Bob win seem to be the same.

Now let us check how long, on the average, Alice has to wait to see the `HT`

sequence. Here is Alice’s simulator:

```
function alice()
a = rand(('H', 'T'))
i = 1
while true
b = rand(('H', 'T'))
i += 1
a == 'H' && b == 'T' && return i
a = b
end
end
```

Let us check it:

```
julia> using StatsBase
julia> describe([alice() for _ in 1:100_000_000])
Summary Stats:
Length: 100000000
Missing Count: 0
Mean: 3.999890
Std. Deviation: 2.000032
Minimum: 2.000000
1st Quartile: 2.000000
Median: 3.000000
3rd Quartile: 5.000000
Maximum: 31.000000
Type: Int64
```

So it seems that, in expectation, Alice finishes her game in 4 tosses. Can we expect the same for Ben (as we remember – if they play together they have the same chances of finishing first)? Let us see.

Now let us check how long, on the average, Bob has to wait to see the `HH`

sequence. Here is Bob’s simulator:

```
function bob()
a = rand(('H', 'T'))
i = 1
while true
b = rand(('H', 'T'))
i += 1
a == 'H' && b == 'H' && return i
a = b
end
end
```

Let us check it:

```
julia> describe([bob() for _ in 1:100_000_000])
Summary Stats:
Length: 100000000
Missing Count: 0
Mean: 5.999915
Std. Deviation: 4.690177
Minimum: 2.000000
1st Quartile: 2.000000
Median: 5.000000
3rd Quartile: 8.000000
Maximum: 87.000000
Type: Int64
```

To our surprise, Bob needs 6 coin tosses, on the average, to see `HH`

.

What is the reason of this difference? Assume we have just tossed `H`

. Start with Bob. If we hit `H`

we finish. If we hit `T`

we then need to wait till we see `H`

again to be able to consider finishing.
However, if we are Alice if we hit `T`

we finish, but if we hit `H`

we do not have to wait for anything – we are already in a state that gives us a chance to finish the game in the next step.

The difference between joint games and separate games is a bit surprising and I hope you found it interesting if you have not seen this puzzle before. Today I have approached this problem using simulation. However, it is easy to write down a Markov chain representation of all three scenarios and solve them analytically. I encourage you to try doing this exercise.

PS:

In the code I use the `rand(('H', 'T'))`

form to generate randomness. It is much faster than e.g. writing `rand(["H", "T"])`

(which would be a first instinct), for two reasons:

- using
`Char`

instead of`String`

is a more lightweight option; - using
`Tuple`

instead of`Vector`

avoids allocations.

Let us see a comparison of timing (I cut out the histograms from the output):

```
julia> using BenchmarkTools
julia> @benchmark rand(('H', 'T'))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 1.900 ns … 233.700 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.500 ns ┊ GC (median): 0.00%
Time (mean ± σ): 3.316 ns ± 2.772 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark rand(["H", "T"])
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 15.816 ns … 2.086 μs ┊ GC (min … max): 0.00% … 96.30%
Time (median): 19.019 ns ┊ GC (median): 0.00%
Time (mean ± σ): 23.041 ns ± 60.335 ns ┊ GC (mean ± σ): 9.44% ± 3.63%
Memory estimate: 64 bytes, allocs estimate: 1.
```

In this case we could also just use `rand(Bool)`

(as the coin is fair and has only two states):

```
julia> @benchmark rand(Bool)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 1.700 ns … 131.000 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.200 ns ┊ GC (median): 0.00%
Time (mean ± σ): 3.218 ns ± 2.112 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
```

but as you can see `rand(('H', 'T'))`

has a similar speed and leads to a much more readable code.

Today I want to go back to a topic of performance considerations of using anonymous functions in combination with DataFrames.jl. I have written about it in the past, but it is an issue that new users often ask about. In the post I will explain you the problem, its causes, and how to avoid it.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Consider the following sequence of DataFrames.jl operations (I am suppressing the output of the operations with `;`

as it is irrelevant):

```
julia> using DataFrames
julia> df = DataFrame(x=1:3);
julia> @time select(df, :x => (x -> 2 * x) => :x2);
0.077134 seconds (73.33 k allocations: 5.010 MiB, 99.28% compilation time)
julia> @time select(df, :x => (x -> 2 * x) => :x2);
0.013731 seconds (6.30 k allocations: 450.148 KiB, 98.15% compilation time)
julia> @time subset(df, :x => ByRow(x -> x > 1.5));
0.094046 seconds (91.86 k allocations: 6.219 MiB, 99.29% compilation time)
julia> @time subset(df, :x => ByRow(x -> x > 1.5));
0.086597 seconds (43.05 k allocations: 2.881 MiB, 42.62% gc time, 99.44% compilation time)
```

In both `select`

and `subset`

examples I used an anonymous function. In the first case it was `x -> 2 * x`

, and in the second `x -> x > 1.5`

.

What you can notice is that most of the time (even in consecutive calls) is spent in compilation. What is the reason for this?

Let me explain this by example. When we pass the `x -> 2 * x`

function to `select`

, the `select`

function needs to be compiled. Since the `select`

function
is quite complex its compilation time is long.

Why does this happen. The reason is that each time we write `x -> 2 * x`

Julia defines a new anonymous function. Julia compiler does not recognize that it is in fact the same function. Have a look here:

```
julia> x -> 2 * x
#9 (generic function with 1 method)
julia> x -> 2 * x
#11 (generic function with 1 method)
```

We can see that we get a different function (one denoted `#9`

and the other `#11`

) although the definition of the function is identical.

Fortunately, there is a simple way to resolve this problem. Instead of using an anonymous function, just use a named function:

```
julia> times2(x) = 2 * x
times2 (generic function with 1 method)
julia> @time select(df, :x => times2 => :x2);
0.013728 seconds (5.63 k allocations: 401.305 KiB, 98.54% compilation time)
julia> @time select(df, :x => times2 => :x2);
0.000142 seconds (71 allocations: 3.516 KiB)
julia> gt15(x) = x > 1.5
gt15 (generic function with 1 method)
julia> @time subset(df, :x => ByRow(gt15));
0.041173 seconds (42.64 k allocations: 2.849 MiB, 99.01% compilation time)
julia> @time subset(df, :x => ByRow(gt15));
0.000165 seconds (120 allocations: 5.648 KiB)
```

Now you see that consecutive calls are fast and do not cause compilation.

Actually, instead of defining the `gt15`

function we could just have written `>(1.5)`

:

```
julia> >(1.5)
(::Base.Fix2{typeof(>), Float64}) (generic function with 1 method)
```

Which defines a functor that works as a named function (so it requires only one compilation):

```
julia> @time subset(df, :x => ByRow(>(1.5)));
0.075423 seconds (41.80 k allocations: 2.804 MiB, 99.32% compilation time)
julia> @time subset(df, :x => ByRow(>(1.5)));
0.000189 seconds (124 allocations: 5.898 KiB)
```

If you want to learn how functors work in Julia, have a look here.

Today I have presented some simple examples, but I hope that they are useful for new users of DataFrames.jl in helping them to improve the performance of their code.

]]>This week I got a nice little surprise in my office. A year after my Julia for Data Analysis book has been published I got a package with a set of printed versions of its Korean translation 데이터 분석을 위한 줄리아. It was really a nice experience and I hope that Julia users from Korea will like it.

Therefore, for today, I decided to discuss a functionality that is little known, but often quite useful. It is related to adding conditional columns to a data frame.

The post was written under Julia 1.10.1, DataFrames.jl 1.6.1, and DataFramesMeta.jl 0.15.2.

Assume you have the following data frame:

```
julia> using DataFrames
julia> df = DataFrame(x=-2.0:0.5:2.0)
9×1 DataFrame
Row │ x
│ Float64
─────┼─────────
1 │ -2.0
2 │ -1.5
3 │ -1.0
4 │ -0.5
5 │ 0.0
6 │ 0.5
7 │ 1.0
8 │ 1.5
9 │ 2.0
```

Now we want to add a second column to this data frame that contains a square root of column `"x"`

.

A basic approach fails:

```
julia> df.sqrtx = sqrt.(df.x)
ERROR: DomainError with -2.0:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
```

The reason is that we cannot normally take a square root of a negative number.

We can perform a conditional processing for example like this:

```
julia> df.sqrtx = (x -> x < 0.0 ? missing : sqrt(x)).(df.x)
9-element Vector{Union{Missing, Float64}}:
missing
missing
missing
missing
0.0
0.7071067811865476
1.0
1.224744871391589
1.4142135623730951
julia> df
9×2 DataFrame
Row │ x sqrtx
│ Float64 Float64?
─────┼─────────────────────────
1 │ -2.0 missing
2 │ -1.5 missing
3 │ -1.0 missing
4 │ -0.5 missing
5 │ 0.0 0.0
6 │ 0.5 0.707107
7 │ 1.0 1.0
8 │ 1.5 1.22474
9 │ 2.0 1.41421
```

but I do not find this approach very readable (especially from the perspective of a beginner).

The alternative that I prefer is to work with a view of the source data frame. Let us first create such a view that contains all columns of the original data frame, but only rows in which column `"x"`

is non-negative:

```
julia> dfv = filter(:x => >=(0.0), df, view=true)
5×2 SubDataFrame
Row │ x sqrtx
│ Float64 Float64?
─────┼───────────────────
1 │ 0.0 0.0
2 │ 0.5 0.707107
3 │ 1.0 1.0
4 │ 1.5 1.22474
5 │ 2.0 1.41421
```

Now, we can add a column to such a view by using a plain `sqrt`

function without any decorations:

```
julia> dfv.sqrtx2 = sqrt.(dfv.x)
5-element Vector{Float64}:
0.0
0.7071067811865476
1.0
1.224744871391589
1.4142135623730951
julia> dfv
5×3 SubDataFrame
Row │ x sqrtx sqrtx2
│ Float64 Float64? Float64?
─────┼─────────────────────────────
1 │ 0.0 0.0 0.0
2 │ 0.5 0.707107 0.707107
3 │ 1.0 1.0 1.0
4 │ 1.5 1.22474 1.22474
5 │ 2.0 1.41421 1.41421
julia> df
9×3 DataFrame
Row │ x sqrtx sqrtx2
│ Float64 Float64? Float64?
─────┼─────────────────────────────────────────
1 │ -2.0 missing missing
2 │ -1.5 missing missing
3 │ -1.0 missing missing
4 │ -0.5 missing missing
5 │ 0.0 0.0 0.0
6 │ 0.5 0.707107 0.707107
7 │ 1.0 1.0 1.0
8 │ 1.5 1.22474 1.22474
9 │ 2.0 1.41421 1.41421
```

Note that both `dfv`

and `df`

are updated as expected. The filtered-out rows get `missing`

values.

It is important to highlight that this functionality works if the view (`SubDataFrame`

) was created using all columns of the source data frame (like is done in the case of our `filter`

call above).
The reason for this restriction is that if view contained some subset of columns the operation of adding a column would be unsafe (there would be a risk of accidental and unwanted overwrite of a column present in the source data frame that was not included in the view).

This functionality is especially nice in combination with DataFramesMeta.jl, just have a look:

```
julia> @chain df begin
@rsubset(:x >= 0; view=true)
@rtransform!(:sqrtx3 = sqrt(:x))
parent
end
9×4 DataFrame
Row │ x sqrtx sqrtx2 sqrtx3
│ Float64 Float64? Float64? Float64?
─────┼─────────────────────────────────────────────────────────
1 │ -2.0 missing missing missing
2 │ -1.5 missing missing missing
3 │ -1.0 missing missing missing
4 │ -0.5 missing missing missing
5 │ 0.0 0.0 0.0 0.0
6 │ 0.5 0.707107 0.707107 0.707107
7 │ 1.0 1.0 1.0 1.0
8 │ 1.5 1.22474 1.22474 1.22474
9 │ 2.0 1.41421 1.41421 1.41421
```

In the code above I used `parent`

in the last step to recover the source `df`

.

As a final comment note that an alternative in DataFramesMeta.jl is to just use a plain `@rtransform!`

macro:

```
julia> @rtransform!(df, :sqrtx4 = :x < 0 ? missing : sqrt(:x))
9×5 DataFrame
Row │ x sqrtx sqrtx2 sqrtx3 sqrtx4
│ Float64 Float64? Float64? Float64? Float64?
─────┼─────────────────────────────────────────────────────────────────────────
1 │ -2.0 missing missing missing missing
2 │ -1.5 missing missing missing missing
3 │ -1.0 missing missing missing missing
4 │ -0.5 missing missing missing missing
5 │ 0.0 0.0 0.0 0.0 0.0
6 │ 0.5 0.707107 0.707107 0.707107 0.707107
7 │ 1.0 1.0 1.0 1.0 1.0
8 │ 1.5 1.22474 1.22474 1.22474 1.22474
9 │ 2.0 1.41421 1.41421 1.41421 1.41421
```

In this case it also quite clean.

I am really happy that we have a Korean version of Julia for Data Analysis.

I hope that the example transformations I have shown today were useful and improved your knowledge of DataFrames.jl and DataFramesMeta.jl packages.

]]>This week it is a holiday period in Poland so I decided to solve a puzzle. I liked the code as it can be used to show some basic features of the Julia language.

The examples were written under Julia 1.10.1, HTTP.jl 1.10.6, and Graphs.jl 1.10.0.

I decided to use my favorite Project Euler puzzle set. This time I chose Problem 79.

Here is its statement (taken from the Project Euler website):

A common security method used for online banking is to ask the user for three random characters from a passcode. For example, if the passcode was 531278, they may ask for the 2nd, 3rd, and 5th characters; the expected reply would be: 317. The text file, keylog.txt, contains fifty successful login attempts. Given that the three characters are always asked for in order, analyse the file so as to determine the shortest possible secret passcode of unknown length.

The keylog.txt file can be found under this link: https://projecteuler.net/resources/documents/0079_keylog.txt.

Let us try solving the puzzle.

First we use the HTTP.jl package to get the data and pre-process it. Start by storing the file as a string:

```
julia> using HTTP
julia> url = "https://projecteuler.net/resources/documents/0079_keylog.txt"
"https://projecteuler.net/resources/documents/0079_keylog.txt"
julia> str = String(HTTP.get(url).body)
"319\n680\n180\n690\n129\n620\n762\n689\n762\n318\n368\n710\n720\n710\n629\n168\n160\n689\n716\n731\n736\n729\n316\n729\n729\n710\n769\n290\n719\n680\n318\n389\n162\n289\n162\n718\n729\n319\n790\n680\n890\n362\n319\n760\n316\n729\n380\n319\n728\n716\n"
```

Now we want to process this string into a vector of vectors containing the digits verified by the user.
First we split the string by newlines using the `split`

function. Next We process each line by transforming it into a vector of numbers. We use two features of Julia here. The first is the `collect`

function, which when passed a string returns a vector of characters. The second is broadcasting. By broadcasted substraction of `'0'`

from a vector of characters we get a vector of integers. Here is the code:

```
julia> v = [collect(x) .- '0' for x in split(str)]
50-element Vector{Vector{Int64}}:
[3, 1, 9]
[6, 8, 0]
[1, 8, 0]
⋮
[3, 1, 9]
[7, 2, 8]
[7, 1, 6]
```

Now we are ready to analyze the data. We will use a directed graph to represent it.
The directed graph will have 10 nodes. Each representing a digit. Because Julia uses
1-based indexing, node number of digit `x`

will be `x+1`

.
Here is the code creating the directed graph:

```
julia> using Graphs
julia> gr = DiGraph(10, 0)
{10, 0} directed simple Int64 graph
julia> for x in v
add_edge!(gr, x[1] + 1, x[2] + 1)
add_edge!(gr, x[2] + 1, x[3] + 1)
end
julia> gr
{10, 23} directed simple Int64 graph
```

Note that we have 23 relationships constraining the sequence of the numbers in the unknown password. Let us check, for each number the number of times it is the preceeding or a following in our graph:

```
julia> [outdegree(gr) indegree(gr)]
10×2 Matrix{Int64}:
0 5
5 2
3 3
3 1
0 0
0 0
4 3
5 0
2 4
1 5
```

From this summary we see that the first node (representing digit `0`

) is never a source, so it can be a last digit in a pass code. Similarly eighth node (representing `7`

) is never a destination, so it can be a first digit. Finally, digits `4`

and `5`

are never neither a source or a destination, so they can be dropped.

How can we programattically find the list of nodes that can be dropped? We can simply find all nodes whose total degree is `0`

:

```
julia> to_drop = findall(==(0), degree(gr)) .- 1
2-element Vector{Int64}:
4
5
```

Now we are ready for a final move. Let us assume that our directed graph does not have cycles (this is a simple case, as then we can assume that each number is present exactly once in the code). In this case we can use the topological sorting to find the shortest sequence of numbers consistent with the observed data. In our case to get the topological sorting of nodes in the graph we can write:

```
julia> ts = topological_sort(gr)
10-element Vector{Int64}:
8
6
5
4
2
7
3
9
10
1
```

We did not get an error, which means that our directed graph did not have any cycles, so we are done.

What is left to get a solution is to correct the node-numbering (as we start numbering with `1`

and the smallest digit is `0`

) and remove the numbers that are never used. As usual, I leave the final solution un-evaluated, to encourage you to run the code yourself:

```
setdiff(ts .- 1, to_drop)
```

I hope you enjoyed the puzzle and the solution!

]]>