Author Archives: Jan Vanhove

Adjusting to Julia: Tea tasting

By: Jan Vanhove

Re-posted from: https://janhove.github.io/posts/2023-02-23-julia-tea-tasting/index.html

In this third blog post in which I try my hand at the Julia language, I’ll tackle a slight variation of an old problem – Fisher’s tea tasting lady – both analytically and using a brute-force simulation.

The lady tasting tea

In The Design of Experiments, Ronald A. Fisher explained the Fisher exact test using the following example. Imagine that a lady claims she can taste the difference between cups of tea in which the tea was poured into the cup first and then milk was added, and cups of tea in which the milk was poured first and then the tea was added. A sceptic might put the lady to the test and prepare eight cups of tea – four with tea to which milk was added, and four with milk to which tea was added. (Yuck to both, by the way.) The lady is presented with these in a random order and is asked to identify those four cups with tea to which milk was added. Now, if the lady has no discriminatory ability whatever, there is only a 1-in-70 chance she identifies all four cups correctly. This is because there are 70 ways of picking four cups out of eight, and only one of these ways is correct. In Julia:

binomial(8, 4)
70

We can now imagine a slight variation on this experiment. If the lady identifies all four cups correctly, we choose to believe she has the purported discriminatory ability. If she identifies two or fewer cups correctly, we remain sceptical. But she identifies three out of four cups correctly, we prepare another eight cups of tea and give her another chance under the same conditions.

We can ask two questions about this new procedure:

  1. With which probability will we believe the lady if she, in fact, does not have any discriminatory ability?
  2. How many rounds of tea tasting will we need on average before the experiment terminates?

In the following, I’ll share both analytical and a simulation-based answers to these questions.

Analytical solution

Under the null hypothesis of no discriminatory ability, the number of correctly identified cups in any one draw () follows a hypergeometric distribution with parameters (total), (successes) and (draws), i.e., [ X (8, 4, 4). ] In any given round, the subject fails the test if she only identifies 0, 1 or 2 cups correctly. Under the null hypothesis, the probability of this happening is given by , the value of which we can determine using the cumulative mass function of the Hypergeometric(8, 4, 4) distribution. We suspend judgement on the subject’s discriminatory abilities if she identifies exactly three cups correctly, in which case she has another go. Under the null hypothesis, the probability of this happening in any given round is given by , the value of which can be determined using the probability mass function of the Hypergeometric(8, 4, 4) distribution.

With those probabilities in hand, we can now compute the probability that the subject fails the experiment in a specific round, assuming the null hypothesis is correct. In the first round, she will fail the experiment with probability . In order to fail in the second round, she needed to have advanced to the second round, which happens with probability , and then fail in that round, which happens with probability . That is, she will fail in the second round with probability . To fail in the third round, she needed to advance to the third round, which happens with probability and then fail in the third round, which happens with probability . That is, she will fail in the third round with probability . Etc. etc. The probability that she will fail somewhere in the experiment if the null hypothesis is true, then, is given by where the first equality is just a matter of shifting the index and the second equality holds because the expression is a geometric series.

Let’s compute the final results using Julia. The following loads the DataFrames and Distributions packages and then defines d as the Hypergeometric(8, 4, 4) distribution. Note that in Julia, the parameters for the Hypergeometric distribution aren’t (total), (successes) and (draws), but rather (successes), (failures) and (draws); see the documentation. We then read off the values for and from the cumulative mass function and probability mass function, respectively:

using Distributions
d = Hypergeometric(4, 4, 4);
p = cdf(d, 2);
q = pdf(d, 3);

The probability that the subject will fail the experiment if she does indeed not have the purported discriminatory ability is now easily computed:

p / (1 - q)
0.9814814814814815

The next question is how many rounds we expect the experiment to carry on for if the null hypothesis is true. At each round, the probability that the experiment terminates in that round is given by . From the geometric distribution, we know that we then on average need attempts before the first terminating event occurs:

1 / (1 - q)
1.2962962962962965

In sum, if the subject lacks any discriminatory ability, there is only a 1.85% chance that she will pass the test, and on average, the experiment will run for 1.3 rounds.

Brute-force solution

First, we define a function experiment() that runs the experiment once. In essence, we have an urn that contains four correct identifications (true) and four incorrect identifications (false). From this urn, we sample() four identifications without replacement.

Note, incidentally, that Julia functions can take both positional arguments and keyword arguments. In the sample() command below, both urn and 4 are passed as positional arguments, and you’d have to read the documentation to figure out which argument specifies what. The keyword arguments are separated from the positional arguments by a semi-colon and are identified with the parameter’s name.

Next, we count the number of trues in our pick using sum(). Depending on how many trues there are in pick, we terminate the experiment, returning false if we remain sceptic and true if we choose to believe the lady, or we run the experiment for one more round. The number of attempts are tallied and output as well.

function experiment(attempt = 1)
    urn = [false, false, false, false, true, true, true, true]
    pick = sample(urn, 4; replace = false)
    number = sum(pick)
    if number <= 2
        return false, attempt
    elseif number >= 4
        return true, attempt
    else
      return experiment(attempt + 1)
    end
end;

A single run of experiment() could produce the following output:

experiment()
(false, 1)

Next, we write a function simulate() that runs the experiment() a large number of times, and outputs both whether each experiment() led to us believe the lady or remaining sceptical, and how many round each experiment() took. These results are tabulated in a DataFrame – just because. Of note, Julia supports list comprehension that Python users will be familiar with. I use this feature here both the run the experiment a large number of times as well as to parse the output.

using DataFrames

function simulate(runs = 10000)
    results = [experiment() for _ in 1:runs]
    success = [results[i][1] for i in 1:runs]
    attempts = [results[i][2] for i in 1:runs]
    d = DataFrame(Successful = success, Attempts = attempts)
    return d
end;

Let’s swing for the fences and run this experiment a million times. Like in Python, we can make large numbers easier to parse by inserting underscores in them:

runs = 1_000_000;

Using the @time macro, we can check how long it take for this simulation to finish.

@time d = simulate(runs)
  0.359740 seconds (4.07 M allocations: 361.334 MiB, 14.82% gc time, 35.07% compilation time)
1000000×2 DataFrame
999975 rows omitted
Row Successful Attempts
Bool Int64
1 false 1
2 false 1
3 false 1
4 false 1
5 false 1
6 false 2
7 false 3
8 false 2
9 false 1
10 false 1
11 false 1
12 false 1
13 false 2
999989 false 1
999990 false 1
999991 false 1
999992 false 4
999993 false 1
999994 false 1
999995 false 1
999996 false 1
999997 false 1
999998 false 1
999999 false 1
1000000 false 1

On my machine then, running this simulation takes less than a second. Note that 60% of this time is compilation time. (Update: When migrating my blog to Quarto, I reran this code using a new Julia version (1.9.1). Now the code runs faster.) Indeed, if we run the function another time, i.e., after it’s been compiled, the run time drops to about 0.3 seconds (Update: 0.2 seconds now.):

@time d2 = simulate(runs)
  0.209087 seconds (3.89 M allocations: 348.982 MiB, 16.29% gc time)
1000000×2 DataFrame
999975 rows omitted
Row Successful Attempts
Bool Int64
1 false 1
2 false 1
3 false 1
4 false 1
5 false 1
6 false 2
7 false 2
8 false 1
9 false 1
10 false 1
11 false 2
12 false 1
13 false 1
999989 false 3
999990 false 3
999991 false 1
999992 false 1
999993 false 1
999994 false 1
999995 false 1
999996 false 1
999997 false 1
999998 false 2
999999 false 1
1000000 false 1

Using describe(), we see that this simulation – which doesn’t ‘know’ anything about hypergeometric and geometric distributions, produces the same answers that we arrived at by analytical means: There’s a 1.86% chance that we end up believing the lady even if she has no discriminatory ability whatsoever. And if she doesn’t have any discriminatory ability, we’ll need 1.3 rounds on average before terminating the experiment:

describe(d)
2×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Float64 Integer Float64 Integer Int64 DataType
1 Successful 0.018533 false 0.0 true 0 Bool
2 Attempts 1.29611 1 1.0 10 0 Int64

The slight discrepancy between the simulation-based results and the analytical ones are just due to randomness. Below is a quick way for constructing 95% confidence intervals around both of our simulation-based estimates, and the analytical solutions fall within both intervals.

means = mean.(eachcol(d))
2-element Vector{Float64}:
 0.018533
 1.296105
ses = std.(eachcol(d)) / sqrt(runs)
2-element Vector{Float64}:
 0.00013486862533794173
 0.0006198767726106645
upr = means + 1.96*ses
2-element Vector{Float64}:
 0.018797342505662368
 1.297319958474317
lwr = means - 1.96*ses
2-element Vector{Float64}:
 0.018268657494337634
 1.2948900415256832

Adjusting to Julia: The Levenshtein algorithm

By: Jan Vanhove

Re-posted from: https://janhove.github.io/posts/2023-02-09-julia-levenshtein/index.html

In this second blog post about Julia, I’ll share with you a Julia implementation of the Levenshtein algorithm.

The Levenshtein algorithm

The basic Levenshtein algorithm is used to count the minimum number of insertions, deletions and substitutions that are needed to convert one string into another. For instance, to convert English doubt into French doute, you need at least two operations. You could replace the b by a t and then replace the t by an e; or you could delete the b and then insert the e. As this example shows, there may be more than one way to convert one string into another using the minimum number of required operations, but this minimum number itself is unique for each pair of strings.

Implementation in Julia

I won’t cover the logic of the Levenshtein algorithm here. The following is a straightforward Julia implementation of the pseudocode found on Wikipedia, assuming a cost of 1 for all operations. The function takes two inputs (a string a that is to be converted to a string b) and outputs an array with the Levenshtein distances between all substrings of a on the one hand and all substrings of b on the other hand. The entry in the bottom right corner of this array is the Levenshtein distances between the full strings and this is output separately as well.

function levenshtein(a::String, b::String)
    # Initialise table
    distances = zeros(Int, length(a) + 1, length(b) + 1)
    distances[:, 1] = 0:length(a)
    distances[1, :] = 0:length(b)

    # Levenshtein logic
    for row in 2:(length(a) + 1)
        for col in 2:(length(b) + 1)
            distances[row, col] = min(
                distances[row - 1, col - 1] + Int(a[row - 1] != b[col - 1] ? 1 : 0)
                , distances[row, col - 1] + 1
                , distances[row - 1, col] + 1
            )
        end
    end

    return distances, distances[length(a) + 1, length(b) + 1]
end
levenshtein (generic function with 1 method)

Let’s compute the Levenshtein distance between the German word Zyklus (‘cycle’) and its Swedish counterpart cykel. Note the use of ; at the end of the line to suppress the output.

dist_matrix, lev_cost = levenshtein("zyklus", "cykel");
display(dist_matrix)
7×6 Matrix{Int64}:
 0  1  2  3  4  5
 1  1  2  3  4  5
 2  2  1  2  3  4
 3  3  2  1  2  3
 4  4  3  2  2  2
 5  5  4  3  3  3
 6  6  5  4  4  4

This checks out: you do indeed need four operations to transform Zyklus into cykel.

A vectorised function

But what if we wanted to apply our new functions to several pairs of strings? Let’s first define three Dutch-German word pairs:

dutch = ("boek", "zuster", "sneeuw");
german = ("buch", "schwester", "schnee");

We can run our levenshtein() on these three word pairs without introducing for-loops by simply appending a dot to the function name:

levenshtein.(dutch, german)
(([0 1 … 3 4; 1 0 … 2 3; … ; 3 2 … 2 3; 4 3 … 3 3], 3), ([0 1 … 8 9; 1 1 … 8 9; … ; 5 4 … 5 6; 6 5 … 6 5], 5), ([0 1 … 5 6; 1 0 … 4 5; … ; 5 4 … 4 3; 6 5 … 5 4], 4))

However, since the levenshtein() function outputs two pieces of information (both the matrix with the distances between the substrings as well as the final Levenshtein distance), this vectorised call yields a tuple of three subtuples, each subtuple containing both a matrix and the corresponding final Levenshtein distance. This is why the output above looks so messy. If we wanted to obtain just the Levenshtein distances, we could write a for-loop to extract them. But I think an easier solution is to first write a wrapper around the levenshtein() function that outputs only the final Levenshtein distance and use the vectorised version of this wrapper instead:

function lev_dist(a::String, b::String)
  return levenshtein(a, b)[2]
end
lev_dist (generic function with 1 method)

Now use the vectorised version of lev_dist():

lev_dist.(dutch, german)
(3, 5, 4)

Nice!

Obtaining the operations

We now know that we need four operations to transform Zyklus into cykel and five to transform zuster into Schwester. But which are the operations that you need for these transformations? The function lev_alignment() defined below outputs one possible set of operations that would do the job. (Unlike the minimum number of operations required to transform one string into another, the set of operations needed isn’t uniquely defined.)

function lev_alignment(a::String, b::String)
    source = Vector{Char}()
    target = Vector{Char}()
    operations = Vector{Char}()
    
    lev_matrix = levenshtein(a, b)[1]
    
    row = size(lev_matrix, 1)
    col = size(lev_matrix, 2)

    while (row > 1 && col > 1)
        if lev_matrix[row - 1, col - 1] == lev_matrix[row, col] &&
            lev_matrix[row - 1, col - 1] <= min(
              lev_matrix[row - 1, col]
              , lev_matrix[row, col - 1]
              )
            row = row - 1
            col = col - 1
            pushfirst!(source, a[row])
            pushfirst!(target, b[col])
            pushfirst!(operations, ' ')
        else 
            if lev_matrix[row - 1, col] <= min(lev_matrix[row - 1, col - 1], lev_matrix[row, col - 1])
                row = row - 1
                pushfirst!(source, a[row])
                pushfirst!(target, ' ')
                pushfirst!(operations, 'D')
            elseif lev_matrix[row, col - 1] <= lev_matrix[row - 1, col - 1]
                col = col - 1
                pushfirst!(source, ' ')
                pushfirst!(target, b[col])
                pushfirst!(operations, 'I')
            else
                row = row - 1
                col = col - 1
                pushfirst!(source, a[row])
                pushfirst!(target, b[col])
                pushfirst!(operations, 'S')
            end
        end
    end

    # If first column reached, move up.
    while (row > 1)
        row = row - 1
        pushfirst!(source, a[row])
        pushfirst!(target, ' ')
        pushfirst!(operations, 'D')
    end

    # If first row reached, move left.
    while (col > 1)
        col = col - 1
        pushfirst!(source, ' ')
        pushfirst!(target, b[col])
        pushfirst!(operations, 'I')
    end
    
    return vcat(
        reshape(source, (1, :))
        , reshape(target, (1, :))
        , reshape(operations, (1, :))
    )
end
lev_alignment (generic function with 1 method)

I won’t cover the logic behind the algorithm as this is more about learning Julia that the Levenshtein algorithm. On the Julia side, note first how empty character vectors can be initialised. Moreover, notice that the pushfirst!()
function is decorated with a ! (a ‘bang’). This communicates to whoever is reading the code that this function changes some of its input. For instance, pushfirst!(source, a[row]) means that the current character of a (i.e., a[row]) is added to the front of the source vector. That is, this command changes the source vector. Finally, the source, target and operations vectors are all column vectors. In order to display them somewhat nicely, I converted each of them to a single-row matrix using reshape(). Then, the three resulting rows are concatenated vertically using vcat() to show how the two strings can be aligned and which operations are needed to transform one into the other.

Let’s see how we can transform Zyklus into cykel:

lev_alignment("zyklus", "cykel")
3×7 Matrix{Char}:
 'z'  'y'  'k'  ' '  'l'  'u'  's'
 'c'  'y'  'k'  'e'  'l'  ' '  ' '
 'S'  ' '  ' '  'I'  ' '  'D'  'D'

So we substitute c for z, insert an e and delete the u and s. As I mentioned, this set of operations isn’t uniquely defined. Indeed, we could have also substituted c for z, e for l and l for u and then deleted the s. This also corresponds to a Levenshtein distance of four operations.

Normalised Levenshtein distances

Above, we computed raw Levenshtein distances. The problem with these is that longer string pairs will tend to have larger raw Levenshtein distances than shorter string pairs, even if they do seem more similar. To correct for this, we can computed normalised Levenshtein distances instead. There are various ways to compute these; one option is to divide the raw Levenshtein distance by the length of the alignment:

function norm_lev_dist(a::String, b::String)
  raw_dist = lev_dist(a, b)
  alignment_length = size(lev_alignment(a, b), 2)
  return raw_dist / alignment_length
end
norm_lev_dist (generic function with 1 method)

(Behind the scenes, we run the Levenshtein algorithm twice: once in lev_dist() and again in lev_alignment(). This seems wasteful – unless the Julia compiler is able to clean up the double work? I’m not sure.)

We obtain a normalised Levenshtein distance of about 0.57 for Zykluscykel:

norm_lev_dist("zyklus", "cykel")
0.5714285714285714

We can use a vectorised version of this function, too:

norm_lev_dist.(dutch, german)
(0.75, 0.5555555555555556, 0.5)

Of course, normalised Levenshtein distances are symmetric, so we obtain the same result when running the following command:

norm_lev_dist.(german, dutch)
(0.75, 0.5555555555555556, 0.5)