Re-posted from: https://bkamins.github.io/julialang/2020/11/27/abt.html
Introduction
After my last post
on random number generation I think that it is good to add one more thing
that Chris Rackauckas pointed out to me as relevant to show a complete
picture. If you are working with pseudo random number generators it is good
to know that Julia has an excellent RandomNumbers.jl package that
offers a wide array of different random number generators to choose from.
So in this post I will use xoroshiro128** generator.
The post was tested under Julia 1.5.3 and packages DataFrames.jl 0.22.1,
Distributions.jl 0.23.8, FreqTables.jl 0.4.2, Pipe.jl 1.3.0,
RandomNumbers.jl 1.4.0, PyPlot.jl 2.9.9 and intends to present a pretty complete
work-flow of investigating a sample data science problem.
A/B testing
In the post we will investigate several simple options how A/B tests
can be performed.
In our scenario we assume that we have to choose from \(n\) alternatives.
Each alternative \(i\in[n]\) has an unknown probability of success \(p_i\).
Assume that e.g. we have \(n\) different ways to promote a product to a customer
and \(p_i\) is a probability that customer buys after seeing a promotion.
Initially we do not know which of the options is the best, i.e. has the highest
\(p_i\), but we are allowed to present them to customers and collect their
responses to learn the unknown probabilities.
The most simple way to run such an experiment is to present each option to the
customer with the same probability \(1/n\). Such an approach focuses purely on
exploration of the alternatives. However, intuitively we feel that it is
wasteful. If after some trials we see that some option is very bad and has
marginal chances of being the best then it is not worth showing it to the
customer. We are not only losing money (we present something that we know is not
very good), but additionally we will not learn much in terms of choosing the
best option.
Another approach would be to always show our customer an option that we
currently believe is best. Such a strategy focuses purely on exploitation.
Similarly to the uniform sampling described above, we can see that it does not
have to lead to good outcomes. We can get stuck with sub-optimal solution, as
we do not give ourselves a chance to learn that maybe some alternative that is
currently not considered the best is actually superior.
So how can we balance the exploration with exploitation? There are many
approaches that can be considered, but one of the simplest, yet very efficient,
is Thompson sampling. In this approach we do the following. We constantly
keep information about the distribution of our beliefs about the quality of each
alternative. Denote the random variable corresponding to the distribution of
these beliefs as \(P_i\) for alternative \(i\). Now in each step we
independently sample a value from each distribution \(P_i\) and choose to
measure the option that has the highest value of the sample. Now why does this
heuristic work? Note that we will tend to sample the alternatives that have a
high value of beliefs (so we do not waste time on sampling very bad options)
but at the same time we will also from time to time sample options that have
a high variance of \(P_i\) distribution (so the options about which we do not
know much; even if at the moment of sampling we believe they are not so good
our beliefs show us that actually they could turn out good if we collect more
evidence on them).
In the codes below I compare these three strategies of running A/B tests. I call
them respectively:
- Uniform: sample each alternative with probability \(1/n\);
- Greedy: sample the alternative with the highest \(E(P_i)\);
- Thompson: sample the alternative with the highest value of a single
independent sample from \(P_i\).
Before we turn to coding one last thing should be discussed. How should we
represent our beliefs \(P_i\)? Fortunately here Bayesian statistics comes handy.
As our experiments are Bernoulli trials, we know that if we represent the
prior beliefs as Beta distributed then the posterior beliefs are also
going to follow the same distribution. Formally we say that the Beta
distribution is a conjugate prior of Bernoulli distribution. If our beliefs have
\(Beta(\alpha, \beta)\) distribution and we see a success then the distribution
of beliefs becomes \(Beta(\alpha+1, \beta)\). On the other hand if we see a
failure it becomes \(Beta(\alpha, \beta+1)\). As you can see it is very easy to
implement.
Setting up the stage
In a fresh Julia session we first load the required packages:
using DataFrames
using Distributions
using FreqTables
using Pipe
using RandomNumbers
using PyPlot
using LinearAlgebra
Next define a function that updates the beliefs about the quality of some
alternative:
update_belief(b::Beta, r::Bool) = Beta(b.α + r, b.β + !r)
Finally define the three decision rules that we want to compare: uniform,
greedy, and Thompson:
abstract type ABRule end
struct Greedy <: ABRule end
struct Thompson <: ABRule end
struct Unif <: ABRule end
function decide(options::Vector{<:Beta}, ::Unif)
return rand(axes(options, 1))
end
function decide(options::Vector{<:Beta}, ::Greedy)
m = mean.(options)
mm = maximum(m)
return rand(findall(==(mm), m))
end
function decide(options::Vector{<:Beta}, ::Thompson)
m = rand.(options)
mm = maximum(m)
return rand(findall(==(mm), m))
end
One technical aspect of our implementations of decision making rule for greedy
and Thompson sampling is that we have to take into consideration that more than
one option can have the same evaluation. That is why we need first to find
which option has maximum mean with mm = maximum(m)
and then randomly select
one of the options that are best with rand(findall(==(mm), m))
.
Now we are ready to run the experiment.
Implementing the experimenter
Here is an implementation of the function that runs the experiments:
function coupled_run!(n::Int, steps::Int, rng::AbstractRNG,
results::DataFrame, runid::Int)
truep = rand(rng, n)
bestp = argmax(truep)
streams = [rand(rng, Bernoulli(p), steps) for p in truep]
for alg in (Greedy(), Unif(), Thompson())
options = [Beta(1, 1) for i in 1:n]
loc = fill(1, n)
for i in 1:steps
d = decide(options, alg)
options[d] = update_belief(options[d], streams[d][loc[d]])
loc[d] += 1
end
push!(results, (runid=runid, alg=string(typeof(alg)),
correct=mean(options[bestp]) == maximum(mean.(options)),
payoff=sum(x -> x.α - 1, options)))
end
end
The coupler_run!
takes five arguments: k
is number of alternatives to
consider, steps
is the length of A/B test experiment, rng
is random number
generator we will use (remember that we want to test RandomNumbers.jl library),
results
is a data frame that we will update and store the experiment results
in, finally runid
is experiment number.
Let us comment on several crucial parts of the above code. Note that we sample
unknown true probabilities of success in the experiment with
truep = rand(rng, n)
. As they are uniformly distributed we set our initial
beliefs about options to also follow a uniform distribution in this line:
options = [Beta(1, 1) for i in 1:n]
.
The next notable thing is that we want the experiments for all algorithms we
consider to be coupled, that is: each alternative should generate the same
sequence of successes and failures for each option. Therefore, for simplicity,
we pre-populate the results of the trials in this line
streams = [rand(rng, Bernoulli(p), steps) for p in truep]
and then just tack
in loc
vector which random number for each option should be picked. Such an
approach will reduce noise when comparing the alternatives. (note that we could
have implemented it more efficiently, but I did not want to over-complicate the
code).
Finally note that we are collecting two informations:
correct
: if what we think is best at the end of the experiment is best
indeed.payoff
: how much payoff we have collected when running the experiment.
Note that both of them are desirable. Everyone probably agrees that it would
be best to have correct
near to one (so that after running the experiment
we are confident that we know what is best). At the same time it is also
desirable to ensure that the experiment itself is not wasteful, so that we
collect as big payoff
as possible.
We will compare the three considered A/B testing rules using these two criteria.
Analyzing the experiment results
First run the experiment (it should take under 20 seconds):
julia> results = DataFrame()
0×0 DataFrame
julia> rng = Xorshifts.Xoroshiro128StarStar(1234)
RandomNumbers.Xorshifts.Xoroshiro128StarStar(0x9e929e92000004d2, 0xda409a400013489a)
julia> @time foreach(runid -> coupled_run!(10, 500, rng, results, runid), 1:20_000)
18.736959 seconds (143.71 M allocations: 8.944 GiB, 6.78% gc time)
julia> results
60000×4 DataFrame
Row │ runid alg correct payoff
│ Int64 String Bool Float64
───────┼───────────────────────────────────
1 │ 1 Greedy false 451.0
2 │ 1 Unif true 354.0
3 │ 1 Thompson true 486.0
4 │ 2 Greedy true 468.0
5 │ 2 Unif true 264.0
⋮ │ ⋮ ⋮ ⋮ ⋮
59997 │ 19999 Thompson true 464.0
59998 │ 20000 Greedy false 472.0
59999 │ 20000 Unif true 394.0
60000 │ 20000 Thompson true 462.0
59991 rows omitted
We have run the sampling 20,000 times in under 20 seconds. We have 60,000 rows
in our DataFrame
as for each sampling we produced three rows corresponding
to three alternatives we consider. In our experiment we considered 10
alternatives and a budget of 500 samples for testing.
First we collect some initial statistics about the alternatives:
julia> @pipe groupby(results, :alg) |>
combine(_, [:correct, :payoff] .=> mean)
3×3 DataFrame
Row │ alg correct_mean payoff_mean
│ String Float64 Float64
─────┼─────────────────────────────────────
1 │ Greedy 0.38145 405.997
2 │ Unif 0.8031 250.394
3 │ Thompson 0.87955 431.31
We see that Thompson sampling has both best probability of being correct and
the best cumulative payoff. Interestingly greedy strategy has low probability
of being correct, but relatively high payoff. The opposite holds for uniform
sampling strategy.
Now we investigate correct selection probability in more detail:
julia> correct_wide = unstack(results, :runid, :alg, :correct)
20000×4 DataFrame
Row │ runid Greedy Unif Thompson
│ Int64 Bool? Bool? Bool?
───────┼────────────────────────────────
1 │ 1 false true true
2 │ 2 true true true
3 │ 3 false true true
4 │ 4 false true true
5 │ 5 false false false
⋮ │ ⋮ ⋮ ⋮ ⋮
19997 │ 19997 false true true
19998 │ 19998 false false true
19999 │ 19999 false true true
20000 │ 20000 false true true
19991 rows omitted
julia> proptable(correct_wide, :Greedy, :Unif)
2×2 Named Array{Float64,2}
Greedy ╲ Unif │ false true
──────────────┼─────────────────
false │ 0.14145 0.4771
true │ 0.05545 0.326
julia> proptable(correct_wide, :Greedy, :Thompson)
2×2 Named Array{Float64,2}
Greedy ╲ Thompson │ false true
──────────────────┼─────────────────
false │ 0.0907 0.52785
true │ 0.02975 0.3517
julia> proptable(correct_wide, :Unif, :Thompson)
2×2 Named Array{Float64,2}
Unif ╲ Thompson │ false true
────────────────┼─────────────────
false │ 0.0893 0.1076
true │ 0.03115 0.77195
Observe that Thompson and uniform sampling give quite similar results most of
the time , with Thompson sampling having a slight edge when they disagree.
Greedy strategy is much worse. It is easy to understand these findings. Greedy
strategy is not doing enough exploration. Both Thompson and uniform sampling
do explore. The edge of Thompson sampling is that it does not waste time testing
very bad alternatives (so it can focus on the good ones and better discriminate
between them).
Similarly we can investigate which alternative gives the best payoff (remember
that we made sure to have the scenarios coupled):
julia> function find_best(alg, payoff)
best_idxs = findall(==(maximum(payoff)), payoff)
best_algs = alg[best_idxs]
sort!(best_algs)
return join(best_algs, '|')
end
find_best (generic function with 1 method)
julia> alg_freq = @pipe groupby(results, :runid) |>
combine(_, [:alg, :payoff] => find_best => :best) |>
groupby(_, :best, sort=true) |>
combine(_, nrow)
3×2 DataFrame
Row │ best nrow
│ String Int64
─────┼────────────────────────
1 │ Greedy 9736
2 │ Greedy|Thompson 120
3 │ Thompson 10144
We observe that uniform sampling for 20,000 simulations was never the best,
while Thompson sampling has a slight edge over greedy sampling in the
probability of being best (there is also a small probability of a tie).
As an exercise let us check, sticking to Bayesian approach, what is the
probability that Thompson sampling has higher probability of being best than
greedy one. Here we use the fact that the Dirichlet distribution is a
conjugate prior for the Categorical distribution, and set uniform a priori
beliefs:
julia> posterior = Dirichlet(alg_freq.nrow .+ 1)
Dirichlet{Float64}(alpha=[9737.0, 121.0, 10145.0])
julia> posterior_dis = rand(rng, posterior, 100_000)
3×100000 Array{Float64,2}:
0.484744 0.486453 0.492359 … 0.494059 0.484432 0.482332
0.00632059 0.00692795 0.00563844 0.0076129 0.00706873 0.00782947
0.508936 0.506619 0.502002 0.498328 0.508499 0.509838
julia> mean(x -> x[1] < x[3], eachcol(posterior_dis))
0.99707
We can see that given the sample size of 20,000 this probability is greater than
99%.
However, we see that the considered fractions actually do not differ that much:
julia> transform(alg_freq, :nrow => (x -> x ./ sum(x)) => :freq)
3×3 DataFrame
Row │ best nrow freq
│ String Int64 Float64
─────┼─────────────────────────────────
1 │ Greedy 9736 0.4868
2 │ Greedy|Thompson 120 0.006
3 │ Thompson 10144 0.5072
and the difference in average payoffs we observed above was relatively higher.
Let us dig into this. First we plot a distribution of difference between payoffs
with Thompson and greedy sampling:
payoff_wide = unstack(results, :runid, :alg, :payoff)
Δpayoff = payoff_wide.Thompson .- payoff_wide.Greedy
hist(Δpayoff, 25)
xlabel("Δpayoff: Thompson-Greedy")
getting the following plot:
Now we see what is going on. If Thompson sampling is worse than greedy sampling
it is worse only slightly. On the other hand greedy sampling can be much worse
than Thompson sampling (and we understand when: this is a scenario when greedy
algorithm gets stuck at some sub-optimal alternative).
We also quickly check the distribution of estimator of the difference in payoffs
using Bayesian bootstrapping (to keep to the style we use in this post):
julia> dd = Dirichlet(length(Δpayoff), 1);
julia> boot = [dot(Δpayoff, rand(rng, dd)) for _ in 1:10_000];
julia> quantile(boot, 0:0.25:1)
5-element Array{Float64,1}:
23.427965439713827
24.64825057737705
24.911299035010433
25.17170629935465
26.513561155964837
and we see, that, as expected the distribution of the difference is very
significantly greater than zero (we could probably easily detect this
significance even with a much smaller number of tests than 20,000).
Conclusion
That was a long post, so let me finish at this point. I hope you enjoyed it and
could also appreciate the data wrangling capabilities that JuliaData ecosystem
offers you. I also encourage you to run some more experiments on your own
to chcek how the considered A/B testing strategies behave under different
parameterizations.
And a usual caveat – performance geeks (are there any in the Julia community?)
can find several spots in my codes that I left sub-optimal to keep them simple.
Feel free to experiment with improving the execution speed of my examples.