By: Jan Vanhove
Re-posted from: https://janhove.github.io/posts/2023-02-23-julia-tea-tasting/index.html
In this third blog post in which I try my hand at the Julia language, I’ll tackle a slight variation of an old problem – Fisher’s tea tasting lady – both analytically and using a brute-force simulation.
The lady tasting tea
In The Design of Experiments, Ronald A. Fisher explained the Fisher exact test using the following example. Imagine that a lady claims she can taste the difference between cups of tea in which the tea was poured into the cup first and then milk was added, and cups of tea in which the milk was poured first and then the tea was added. A sceptic might put the lady to the test and prepare eight cups of tea – four with tea to which milk was added, and four with milk to which tea was added. (Yuck to both, by the way.) The lady is presented with these in a random order and is asked to identify those four cups with tea to which milk was added. Now, if the lady has no discriminatory ability whatever, there is only a 1-in-70 chance she identifies all four cups correctly. This is because there are 70 ways of picking four cups out of eight, and only one of these ways is correct. In Julia:
We can now imagine a slight variation on this experiment. If the lady identifies all four cups correctly, we choose to believe she has the purported discriminatory ability. If she identifies two or fewer cups correctly, we remain sceptical. But she identifies three out of four cups correctly, we prepare another eight cups of tea and give her another chance under the same conditions.
We can ask two questions about this new procedure:
- With which probability will we believe the lady if she, in fact, does not have any discriminatory ability?
- How many rounds of tea tasting will we need on average before the experiment terminates?
In the following, I’ll share both analytical and a simulation-based answers to these questions.
Analytical solution
Under the null hypothesis of no discriminatory ability, the number of correctly identified cups in any one draw () follows a hypergeometric distribution with parameters (total), (successes) and (draws), i.e., [ X (8, 4, 4). ] In any given round, the subject fails the test if she only identifies 0, 1 or 2 cups correctly. Under the null hypothesis, the probability of this happening is given by , the value of which we can determine using the cumulative mass function of the Hypergeometric(8, 4, 4) distribution. We suspend judgement on the subject’s discriminatory abilities if she identifies exactly three cups correctly, in which case she has another go. Under the null hypothesis, the probability of this happening in any given round is given by , the value of which can be determined using the probability mass function of the Hypergeometric(8, 4, 4) distribution.
With those probabilities in hand, we can now compute the probability that the subject fails the experiment in a specific round, assuming the null hypothesis is correct. In the first round, she will fail the experiment with probability . In order to fail in the second round, she needed to have advanced to the second round, which happens with probability , and then fail in that round, which happens with probability . That is, she will fail in the second round with probability . To fail in the third round, she needed to advance to the third round, which happens with probability and then fail in the third round, which happens with probability . That is, she will fail in the third round with probability . Etc. etc. The probability that she will fail somewhere in the experiment if the null hypothesis is true, then, is given by where the first equality is just a matter of shifting the index and the second equality holds because the expression is a geometric series.
Let’s compute the final results using Julia. The following loads the DataFrames
and Distributions
packages and then defines d
as the Hypergeometric(8, 4, 4) distribution. Note that in Julia, the parameters for the Hypergeometric distribution aren’t (total), (successes) and (draws), but rather (successes), (failures) and (draws); see the documentation. We then read off the values for and from the cumulative mass function and probability mass function, respectively:
using Distributions
d = Hypergeometric(4, 4, 4);
p = cdf(d, 2);
q = pdf(d, 3);
The probability that the subject will fail the experiment if she does indeed not have the purported discriminatory ability is now easily computed:
The next question is how many rounds we expect the experiment to carry on for if the null hypothesis is true. At each round, the probability that the experiment terminates in that round is given by . From the geometric distribution, we know that we then on average need attempts before the first terminating event occurs:
In sum, if the subject lacks any discriminatory ability, there is only a 1.85% chance that she will pass the test, and on average, the experiment will run for 1.3 rounds.
Brute-force solution
First, we define a function experiment()
that runs the experiment once. In essence, we have an urn
that contains four correct identifications (true
) and four incorrect identifications (false
). From this urn
, we sample()
four identifications without replacement.
Note, incidentally, that Julia functions can take both positional arguments and keyword arguments. In the sample()
command below, both urn
and 4
are passed as positional arguments, and you’d have to read the documentation to figure out which argument specifies what. The keyword arguments are separated from the positional arguments by a semi-colon and are identified with the parameter’s name.
Next, we count the number of true
s in our pick
using sum()
. Depending on how many true
s there are in pick
, we terminate the experiment, returning false
if we remain sceptic and true
if we choose to believe the lady, or we run the experiment for one more round. The number of attempts are tallied and output as well.
function experiment(attempt = 1)
urn = [false, false, false, false, true, true, true, true]
pick = sample(urn, 4; replace = false)
number = sum(pick)
if number <= 2
return false, attempt
elseif number >= 4
return true, attempt
else
return experiment(attempt + 1)
end
end;
A single run of experiment()
could produce the following output:
Next, we write a function simulate()
that runs the experiment()
a large number of times, and outputs both whether each experiment()
led to us believe the lady or remaining sceptical, and how many round each experiment()
took. These results are tabulated in a DataFrame
– just because. Of note, Julia supports list comprehension that Python users will be familiar with. I use this feature here both the run the experiment a large number of times as well as to parse the output.
using DataFrames
function simulate(runs = 10000)
results = [experiment() for _ in 1:runs]
success = [results[i][1] for i in 1:runs]
attempts = [results[i][2] for i in 1:runs]
d = DataFrame(Successful = success, Attempts = attempts)
return d
end;
Let’s swing for the fences and run this experiment a million times. Like in Python, we can make large numbers easier to parse by inserting underscores in them:
Using the @time
macro, we can check how long it take for this simulation to finish.
0.359740 seconds (4.07 M allocations: 361.334 MiB, 14.82% gc time, 35.07% compilation time)
1000000×2 DataFrame
999975 rows omitted
1 |
false |
1 |
2 |
false |
1 |
3 |
false |
1 |
4 |
false |
1 |
5 |
false |
1 |
6 |
false |
2 |
7 |
false |
3 |
8 |
false |
2 |
9 |
false |
1 |
10 |
false |
1 |
11 |
false |
1 |
12 |
false |
1 |
13 |
false |
2 |
⋮ |
⋮ |
⋮ |
999989 |
false |
1 |
999990 |
false |
1 |
999991 |
false |
1 |
999992 |
false |
4 |
999993 |
false |
1 |
999994 |
false |
1 |
999995 |
false |
1 |
999996 |
false |
1 |
999997 |
false |
1 |
999998 |
false |
1 |
999999 |
false |
1 |
1000000 |
false |
1 |
On my machine then, running this simulation takes less than a second. Note that 60% of this time is compilation time. (Update: When migrating my blog to Quarto, I reran this code using a new Julia version (1.9.1). Now the code runs faster.) Indeed, if we run the function another time, i.e., after it’s been compiled, the run time drops to about 0.3 seconds (Update: 0.2 seconds now.):
@time d2 = simulate(runs)
0.209087 seconds (3.89 M allocations: 348.982 MiB, 16.29% gc time)
1000000×2 DataFrame
999975 rows omitted
1 |
false |
1 |
2 |
false |
1 |
3 |
false |
1 |
4 |
false |
1 |
5 |
false |
1 |
6 |
false |
2 |
7 |
false |
2 |
8 |
false |
1 |
9 |
false |
1 |
10 |
false |
1 |
11 |
false |
2 |
12 |
false |
1 |
13 |
false |
1 |
⋮ |
⋮ |
⋮ |
999989 |
false |
3 |
999990 |
false |
3 |
999991 |
false |
1 |
999992 |
false |
1 |
999993 |
false |
1 |
999994 |
false |
1 |
999995 |
false |
1 |
999996 |
false |
1 |
999997 |
false |
1 |
999998 |
false |
2 |
999999 |
false |
1 |
1000000 |
false |
1 |
Using describe()
, we see that this simulation – which doesn’t ‘know’ anything about hypergeometric and geometric distributions, produces the same answers that we arrived at by analytical means: There’s a 1.86% chance that we end up believing the lady even if she has no discriminatory ability whatsoever. And if she doesn’t have any discriminatory ability, we’ll need 1.3 rounds on average before terminating the experiment:
1 |
Successful |
0.018533 |
false |
0.0 |
true |
0 |
Bool |
2 |
Attempts |
1.29611 |
1 |
1.0 |
10 |
0 |
Int64 |
The slight discrepancy between the simulation-based results and the analytical ones are just due to randomness. Below is a quick way for constructing 95% confidence intervals around both of our simulation-based estimates, and the analytical solutions fall within both intervals.
means = mean.(eachcol(d))
2-element Vector{Float64}:
0.018533
1.296105
ses = std.(eachcol(d)) / sqrt(runs)
2-element Vector{Float64}:
0.00013486862533794173
0.0006198767726106645
2-element Vector{Float64}:
0.018797342505662368
1.297319958474317
2-element Vector{Float64}:
0.018268657494337634
1.2948900415256832