Re-posted from: https://bkamins.github.io/julialang/2020/11/14/sorting.html
Introduction
A classical problem in computing is finding what is the number comparisons that
a comparison sort algorithm minimally requires in worst-case given an
\(n\)-element list. Let us try to tackle this problem, as usual using
DataFrames.jl to support the presentation of the results.
In what follows I use Julia 1.5.3, DataFrames.jl 0.21.8 and Pipe.jl 1.3.0.
Theoretical lower bound
As there are \(n!\) permutations of an \(n\)-element set and doing \(k\)
comparisons allow us to distinguish at most \(2^k\) states so we get that
\(k\geq\lceil\log_2(n!)\rceil\).
Let us use DataFrames.jl and to quickly calculate this bound on the number of
queries for \(n\in[7]\):
julia> using DataFrames
julia> using Pipe
julia> @pipe DataFrame(n=1:7) |>
transform(_, :n => ByRow(factorial) => :permutations) |>
transform(_, :permutations =>
ByRow(x -> ceil(Int, log2(x))) =>
:queries)
7×3 DataFrame
│ Row │ n │ permutations │ queries │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼──────────────┼─────────┤
│ 1 │ 1 │ 1 │ 0 │
│ 2 │ 2 │ 2 │ 1 │
│ 3 │ 3 │ 6 │ 3 │
│ 4 │ 4 │ 24 │ 5 │
│ 5 │ 5 │ 120 │ 7 │
│ 6 │ 6 │ 720 │ 10 │
│ 7 │ 7 │ 5040 │ 13 │
Now, we switch to a core of the exercise — let us write a program that sorts
up to seven elements in no more than the above calculated number of queries.
The judge and the player
Assume we want to have a Judge
that knows the order of objects and can be
queried to compare the position in the order of the queried objects:
julia> mutable struct Judge
state::Vector{Int}
queries::Int
function Judge(state)
@assert isperm(state)
new(state, 0)
end
end
julia> function compare(judge::Judge, i::Int, j::Int)
judge.queries += 1
judge.state[i] < judge.state[j]
end
compare (generic function with 1 method)
As you can see the Judge
will count the number of queries that it received.
Also we restrict the Judge
to accept only permutations for simplicity
(in general it could accept any comparable objects though).
Now we create a player
function that will be allowed to query the Judge
and it is expected to return the ordering of objects that the Judge
has.
julia> using Combinatorics
julia> function player(judge::Judge, options::Vector{Vector{Int}})
@assert !isempty(options)
n = length(options[1])
length(options) == 1 && return options[1]
besti, bestj, best_split = 0, 0, typemax(Int)
starti = length(options) == factorial(n) ? n - 1 : 1
for i in starti:n, j in i+1:n
yes_count = count(opt -> opt[i] < opt[j], options)
current_split = max(yes_count, length(options) - yes_count)
if current_split < best_split
best_split = current_split
besti, bestj = i, j
end
end
query = compare(judge, besti, bestj)
filter!(opt -> (opt[besti] < opt[bestj]) == query, options)
return player(judge, options)
end
player (generic function with 2 methods)
julia> player(judge::Judge) =
player(judge, collect(permutations(1:length(judge.state))))
player (generic function with 2 methods)
Initially player allows all permutations. With each question only the
permutations that are consistent with the received answers are retained. Note
that the player
function uses a very simple rule: it asks the question after
which the number of remaining options in the worst case is minimized. As a
special case the starti = length(options) == factorial(n) ? n - 1 : 1
line
uses the fact that if we ask the first question it does not really matter which
pair of elements we compare (but starting from the second question we consider
all possible comparisons).
Let us test our player
function on a few examples:
julia> j1 = Judge([1, 3, 5, 6, 4, 2])
Judge([1, 3, 5, 6, 4, 2], 0)
julia> player(j1)
6-element Array{Int64,1}:
1
3
5
6
4
2
julia> j1.queries
10
julia> j2 = Judge([7, 6, 5, 4, 3, 1, 2])
Judge([7, 6, 5, 4, 3, 1, 2], 0)
julia> player(j2)
7-element Array{Int64,1}:
7
6
5
4
3
1
2
julia> j2.queries
12
The function seems to be working correctly. Now let us test it systematically
for \(n\in[7]\):
julia> df = DataFrame()
0×0 DataFrame
julia> for n in 1:7
queries = map(permutations(1:n)) do perm
j = Judge(perm)
if player(j) != perm
error("player gave wrong result for $perm input")
end
return j.queries
end
append!(df, DataFrame(n=n, queries=queries))
end
julia> @pipe groupby(df, [:n, :queries], sort=true) |>
combine(_, nrow) |>
unstack(_, :queries, :n, :nrow) |>
coalesce.(_, "") |>
show(_, eltypes=false, summary=false)
│ Row │ queries │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │
├─────┼─────────┼───┼───┼───┼────┼─────┼─────┼──────┤
│ 1 │ 0 │ 1 │ │ │ │ │ │ │
│ 2 │ 1 │ │ 2 │ │ │ │ │ │
│ 3 │ 2 │ │ │ 2 │ │ │ │ │
│ 4 │ 3 │ │ │ 4 │ │ │ │ │
│ 5 │ 4 │ │ │ │ 8 │ │ │ │
│ 6 │ 5 │ │ │ │ 16 │ │ │ │
│ 7 │ 6 │ │ │ │ │ 8 │ │ │
│ 8 │ 7 │ │ │ │ │ 112 │ │ │
│ 9 │ 9 │ │ │ │ │ │ 304 │ │
│ 10 │ 10 │ │ │ │ │ │ 416 │ │
│ 11 │ 11 │ │ │ │ │ │ │ 80 │
│ 12 │ 12 │ │ │ │ │ │ │ 2912 │
│ 13 │ 13 │ │ │ │ │ │ │ 2048 │
As you can see we never got an error, so our player
seems to work correctly.
Also it works optimally in worst-case. The bound we have calculated above was
never exceeded.
As an additional exercise I have used several common functions from
DataFrames.jl to (hopefully) nicely format the resulting table (in DataFrames.jl
0.22 release it will be yet nicer as we are switching to PrettyTables.jl as a
back-end for text/plain printing).
Take away notes
Unfortunately for \(n=8\) the simple algorithm we used does not produce the
desired results any more (warning — this calculation is a bit more lengthly):
julia> queries = map(permutations(1:8)) do perm
j = Judge(perm)
if player(j) != perm
error("player gave wrong result for $perm input")
end
return j.queries
end;
julia> @pipe DataFrame(queries=queries) |>
groupby(_, :queries, sort=true) |>
combine(_, nrow)
4×2 DataFrame
│ Row │ queries │ nrow │
│ │ Int64 │ Int64 │
├─────┼─────────┼───────┤
│ 1 │ 14 │ 1072 │
│ 2 │ 15 │ 22104 │
│ 3 │ 16 │ 16936 │
│ 4 │ 17 │ 208 │
and ceil(Int, log2(factorial(8)))
is equal to 16
. It is known that this
bound is achievable, so we fail in 208 cases. Therefore, as a challenge, I leave
you to think of an algorithm that is better. A natural approach is to find
optimal splits using backtracking while aggressively pruning branches that
cannot lead to an optimal solution to reduce the complexity, but maybe some
simple heuristic on top of the one I considered will be enough?