# A classical sorting exercise

# Introduction

A classical problem in computing is finding what is the number comparisons that a comparison sort algorithm minimally requires in worst-case given an \(n\)-element list. Let us try to tackle this problem, as usual using DataFrames.jl to support the presentation of the results.

In what follows I use Julia 1.5.3, DataFrames.jl 0.21.8 and Pipe.jl 1.3.0.

# Theoretical lower bound

As there are \(n!\) permutations of an \(n\)-element set and doing \(k\) comparisons allow us to distinguish at most \(2^k\) states so we get that \(k\geq\lceil\log_2(n!)\rceil\).

Let us use DataFrames.jl and to quickly calculate this bound on the number of queries for \(n\in[7]\):

```
julia> using DataFrames
julia> using Pipe
julia> @pipe DataFrame(n=1:7) |>
transform(_, :n => ByRow(factorial) => :permutations) |>
transform(_, :permutations =>
ByRow(x -> ceil(Int, log2(x))) =>
:queries)
7×3 DataFrame
│ Row │ n │ permutations │ queries │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼──────────────┼─────────┤
│ 1 │ 1 │ 1 │ 0 │
│ 2 │ 2 │ 2 │ 1 │
│ 3 │ 3 │ 6 │ 3 │
│ 4 │ 4 │ 24 │ 5 │
│ 5 │ 5 │ 120 │ 7 │
│ 6 │ 6 │ 720 │ 10 │
│ 7 │ 7 │ 5040 │ 13 │
```

Now, we switch to a core of the exercise — let us write a program that sorts up to seven elements in no more than the above calculated number of queries.

# The judge and the player

Assume we want to have a `Judge`

that knows the order of objects and can be
queried to compare the position in the order of the queried objects:

```
julia> mutable struct Judge
state::Vector{Int}
queries::Int
function Judge(state)
@assert isperm(state)
new(state, 0)
end
end
julia> function compare(judge::Judge, i::Int, j::Int)
judge.queries += 1
judge.state[i] < judge.state[j]
end
compare (generic function with 1 method)
```

As you can see the `Judge`

will count the number of queries that it received.
Also we restrict the `Judge`

to accept only permutations for simplicity
(in general it could accept any comparable objects though).

Now we create a `player`

function that will be allowed to query the `Judge`

and it is expected to return the ordering of objects that the `Judge`

has.

```
julia> using Combinatorics
julia> function player(judge::Judge, options::Vector{Vector{Int}})
@assert !isempty(options)
n = length(options[1])
length(options) == 1 && return options[1]
besti, bestj, best_split = 0, 0, typemax(Int)
starti = length(options) == factorial(n) ? n - 1 : 1
for i in starti:n, j in i+1:n
yes_count = count(opt -> opt[i] < opt[j], options)
current_split = max(yes_count, length(options) - yes_count)
if current_split < best_split
best_split = current_split
besti, bestj = i, j
end
end
query = compare(judge, besti, bestj)
filter!(opt -> (opt[besti] < opt[bestj]) == query, options)
return player(judge, options)
end
player (generic function with 2 methods)
julia> player(judge::Judge) =
player(judge, collect(permutations(1:length(judge.state))))
player (generic function with 2 methods)
```

Initially player allows all permutations. With each question only the
permutations that are consistent with the received answers are retained. Note
that the `player`

function uses a very simple rule: it asks the question after
which the number of remaining options in the worst case is minimized. As a
special case the `starti = length(options) == factorial(n) ? n - 1 : 1`

line
uses the fact that if we ask the first question it does not really matter which
pair of elements we compare (but starting from the second question we consider
all possible comparisons).

Let us test our `player`

function on a few examples:

```
julia> j1 = Judge([1, 3, 5, 6, 4, 2])
Judge([1, 3, 5, 6, 4, 2], 0)
julia> player(j1)
6-element Array{Int64,1}:
1
3
5
6
4
2
julia> j1.queries
10
julia> j2 = Judge([7, 6, 5, 4, 3, 1, 2])
Judge([7, 6, 5, 4, 3, 1, 2], 0)
julia> player(j2)
7-element Array{Int64,1}:
7
6
5
4
3
1
2
julia> j2.queries
12
```

The function seems to be working correctly. Now let us test it systematically for \(n\in[7]\):

```
julia> df = DataFrame()
0×0 DataFrame
julia> for n in 1:7
queries = map(permutations(1:n)) do perm
j = Judge(perm)
if player(j) != perm
error("player gave wrong result for $perm input")
end
return j.queries
end
append!(df, DataFrame(n=n, queries=queries))
end
julia> @pipe groupby(df, [:n, :queries], sort=true) |>
combine(_, nrow) |>
unstack(_, :queries, :n, :nrow) |>
coalesce.(_, "") |>
show(_, eltypes=false, summary=false)
│ Row │ queries │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │
├─────┼─────────┼───┼───┼───┼────┼─────┼─────┼──────┤
│ 1 │ 0 │ 1 │ │ │ │ │ │ │
│ 2 │ 1 │ │ 2 │ │ │ │ │ │
│ 3 │ 2 │ │ │ 2 │ │ │ │ │
│ 4 │ 3 │ │ │ 4 │ │ │ │ │
│ 5 │ 4 │ │ │ │ 8 │ │ │ │
│ 6 │ 5 │ │ │ │ 16 │ │ │ │
│ 7 │ 6 │ │ │ │ │ 8 │ │ │
│ 8 │ 7 │ │ │ │ │ 112 │ │ │
│ 9 │ 9 │ │ │ │ │ │ 304 │ │
│ 10 │ 10 │ │ │ │ │ │ 416 │ │
│ 11 │ 11 │ │ │ │ │ │ │ 80 │
│ 12 │ 12 │ │ │ │ │ │ │ 2912 │
│ 13 │ 13 │ │ │ │ │ │ │ 2048 │
```

As you can see we never got an error, so our `player`

seems to work correctly.
Also it works optimally in worst-case. The bound we have calculated above was
never exceeded.

As an additional exercise I have used several common functions from DataFrames.jl to (hopefully) nicely format the resulting table (in DataFrames.jl 0.22 release it will be yet nicer as we are switching to PrettyTables.jl as a back-end for text/plain printing).

# Take away notes

Unfortunately for \(n=8\) the simple algorithm we used does not produce the desired results any more (warning — this calculation is a bit more lengthly):

```
julia> queries = map(permutations(1:8)) do perm
j = Judge(perm)
if player(j) != perm
error("player gave wrong result for $perm input")
end
return j.queries
end;
julia> @pipe DataFrame(queries=queries) |>
groupby(_, :queries, sort=true) |>
combine(_, nrow)
4×2 DataFrame
│ Row │ queries │ nrow │
│ │ Int64 │ Int64 │
├─────┼─────────┼───────┤
│ 1 │ 14 │ 1072 │
│ 2 │ 15 │ 22104 │
│ 3 │ 16 │ 16936 │
│ 4 │ 17 │ 208 │
```

and `ceil(Int, log2(factorial(8)))`

is equal to `16`

. It is known that this
bound is achievable, so we fail in 208 cases. Therefore, as a challenge, I leave
you to think of an algorithm that is better. A natural approach is to find
optimal splits using backtracking while aggressively pruning branches that
cannot lead to an optimal solution to reduce the complexity, but maybe some
simple heuristic on top of the one I considered will be enough?