Introduction

After my last post on random number generation I think that it is good to add one more thing that Chris Rackauckas pointed out to me as relevant to show a complete picture. If you are working with pseudo random number generators it is good to know that Julia has an excellent RandomNumbers.jl package that offers a wide array of different random number generators to choose from. So in this post I will use xoroshiro128** generator.

The post was tested under Julia 1.5.3 and packages DataFrames.jl 0.22.1, Distributions.jl 0.23.8, FreqTables.jl 0.4.2, Pipe.jl 1.3.0, RandomNumbers.jl 1.4.0, PyPlot.jl 2.9.9 and intends to present a pretty complete work-flow of investigating a sample data science problem.

A/B testing

In the post we will investigate several simple options how A/B tests can be performed.

In our scenario we assume that we have to choose from \(n\) alternatives. Each alternative \(i\in[n]\) has an unknown probability of success \(p_i\). Assume that e.g. we have \(n\) different ways to promote a product to a customer and \(p_i\) is a probability that customer buys after seeing a promotion.

Initially we do not know which of the options is the best, i.e. has the highest \(p_i\), but we are allowed to present them to customers and collect their responses to learn the unknown probabilities.

The most simple way to run such an experiment is to present each option to the customer with the same probability \(1/n\). Such an approach focuses purely on exploration of the alternatives. However, intuitively we feel that it is wasteful. If after some trials we see that some option is very bad and has marginal chances of being the best then it is not worth showing it to the customer. We are not only losing money (we present something that we know is not very good), but additionally we will not learn much in terms of choosing the best option.

Another approach would be to always show our customer an option that we currently believe is best. Such a strategy focuses purely on exploitation. Similarly to the uniform sampling described above, we can see that it does not have to lead to good outcomes. We can get stuck with sub-optimal solution, as we do not give ourselves a chance to learn that maybe some alternative that is currently not considered the best is actually superior.

So how can we balance the exploration with exploitation? There are many approaches that can be considered, but one of the simplest, yet very efficient, is Thompson sampling. In this approach we do the following. We constantly keep information about the distribution of our beliefs about the quality of each alternative. Denote the random variable corresponding to the distribution of these beliefs as \(P_i\) for alternative \(i\). Now in each step we independently sample a value from each distribution \(P_i\) and choose to measure the option that has the highest value of the sample. Now why does this heuristic work? Note that we will tend to sample the alternatives that have a high value of beliefs (so we do not waste time on sampling very bad options) but at the same time we will also from time to time sample options that have a high variance of \(P_i\) distribution (so the options about which we do not know much; even if at the moment of sampling we believe they are not so good our beliefs show us that actually they could turn out good if we collect more evidence on them).

In the codes below I compare these three strategies of running A/B tests. I call them respectively:

Uniform: sample each alternative with probability \(1/n\);
Greedy: sample the alternative with the highest \(E(P_i)\);
Thompson: sample the alternative with the highest value of a single independent sample from \(P_i\).

Before we turn to coding one last thing should be discussed. How should we represent our beliefs \(P_i\)? Fortunately here Bayesian statistics comes handy. As our experiments are Bernoulli trials, we know that if we represent the prior beliefs as Beta distributed then the posterior beliefs are also going to follow the same distribution. Formally we say that the Beta distribution is a conjugate prior of Bernoulli distribution. If our beliefs have \(Beta(\alpha, \beta)\) distribution and we see a success then the distribution of beliefs becomes \(Beta(\alpha+1, \beta)\). On the other hand if we see a failure it becomes \(Beta(\alpha, \beta+1)\). As you can see it is very easy to implement.

Setting up the stage

In a fresh Julia session we first load the required packages:

using DataFrames
using Distributions
using FreqTables
using Pipe
using RandomNumbers
using PyPlot
using LinearAlgebra

Next define a function that updates the beliefs about the quality of some alternative:

update_belief(b::Beta, r::Bool) = Beta(b.α + r, b.β + !r)

Finally define the three decision rules that we want to compare: uniform, greedy, and Thompson:

abstract type ABRule end

struct Greedy <: ABRule end
struct Thompson <: ABRule end
struct Unif <: ABRule end

function decide(options::Vector{<:Beta}, ::Unif)
    return rand(axes(options, 1))
end

function decide(options::Vector{<:Beta}, ::Greedy)
    m = mean.(options)
    mm = maximum(m)
    return rand(findall(==(mm), m))
end

function decide(options::Vector{<:Beta}, ::Thompson)
    m = rand.(options)
    mm = maximum(m)
    return rand(findall(==(mm), m))
end

One technical aspect of our implementations of decision making rule for greedy and Thompson sampling is that we have to take into consideration that more than one option can have the same evaluation. That is why we need first to find which option has maximum mean with mm = maximum(m) and then randomly select one of the options that are best with rand(findall(==(mm), m)).

Now we are ready to run the experiment.

Implementing the experimenter

Here is an implementation of the function that runs the experiments:

function coupled_run!(n::Int, steps::Int, rng::AbstractRNG,
                      results::DataFrame, runid::Int)
    truep = rand(rng, n)
    bestp = argmax(truep)
    streams = [rand(rng, Bernoulli(p), steps) for p in truep]

    for alg in (Greedy(), Unif(), Thompson())
        options = [Beta(1, 1) for i in 1:n]
        loc = fill(1, n)
        for i in 1:steps
            d = decide(options, alg)
            options[d] = update_belief(options[d], streams[d][loc[d]])
            loc[d] += 1
        end
        push!(results, (runid=runid, alg=string(typeof(alg)),
                        correct=mean(options[bestp]) == maximum(mean.(options)),
                        payoff=sum(x -> x.α - 1, options)))
    end
end

The coupler_run! takes five arguments: k is number of alternatives to consider, steps is the length of A/B test experiment, rng is random number generator we will use (remember that we want to test RandomNumbers.jl library), results is a data frame that we will update and store the experiment results in, finally runid is experiment number.

Let us comment on several crucial parts of the above code. Note that we sample unknown true probabilities of success in the experiment with truep = rand(rng, n). As they are uniformly distributed we set our initial beliefs about options to also follow a uniform distribution in this line: options = [Beta(1, 1) for i in 1:n].

The next notable thing is that we want the experiments for all algorithms we consider to be coupled, that is: each alternative should generate the same sequence of successes and failures for each option. Therefore, for simplicity, we pre-populate the results of the trials in this line streams = [rand(rng, Bernoulli(p), steps) for p in truep] and then just tack in loc vector which random number for each option should be picked. Such an approach will reduce noise when comparing the alternatives. (note that we could have implemented it more efficiently, but I did not want to over-complicate the code).

Finally note that we are collecting two informations:

correct: if what we think is best at the end of the experiment is best indeed.
payoff: how much payoff we have collected when running the experiment.

Note that both of them are desirable. Everyone probably agrees that it would be best to have correct near to one (so that after running the experiment we are confident that we know what is best). At the same time it is also desirable to ensure that the experiment itself is not wasteful, so that we collect as big payoff as possible.

We will compare the three considered A/B testing rules using these two criteria.

Analyzing the experiment results

First run the experiment (it should take under 20 seconds):

julia> results = DataFrame()
0×0 DataFrame

julia> rng = Xorshifts.Xoroshiro128StarStar(1234)
RandomNumbers.Xorshifts.Xoroshiro128StarStar(0x9e929e92000004d2, 0xda409a400013489a)

julia> @time foreach(runid -> coupled_run!(10, 500, rng, results, runid), 1:20_000)
 18.736959 seconds (143.71 M allocations: 8.944 GiB, 6.78% gc time)

julia> results
60000×4 DataFrame
   Row │ runid  alg       correct  payoff
       │ Int64  String    Bool     Float64
───────┼───────────────────────────────────
     1 │     1  Greedy      false    451.0
     2 │     1  Unif         true    354.0
     3 │     1  Thompson     true    486.0
     4 │     2  Greedy       true    468.0
     5 │     2  Unif         true    264.0
   ⋮   │   ⋮       ⋮         ⋮        ⋮
 59997 │ 19999  Thompson     true    464.0
 59998 │ 20000  Greedy      false    472.0
 59999 │ 20000  Unif         true    394.0
 60000 │ 20000  Thompson     true    462.0
                         59991 rows omitted

We have run the sampling 20,000 times in under 20 seconds. We have 60,000 rows in our DataFrame as for each sampling we produced three rows corresponding to three alternatives we consider. In our experiment we considered 10 alternatives and a budget of 500 samples for testing.

First we collect some initial statistics about the alternatives:

julia> @pipe groupby(results, :alg) |>
             combine(_, [:correct, :payoff] .=> mean)
3×3 DataFrame
 Row │ alg       correct_mean  payoff_mean
     │ String    Float64       Float64
─────┼─────────────────────────────────────
   1 │ Greedy         0.38145      405.997
   2 │ Unif           0.8031       250.394
   3 │ Thompson       0.87955      431.31

We see that Thompson sampling has both best probability of being correct and the best cumulative payoff. Interestingly greedy strategy has low probability of being correct, but relatively high payoff. The opposite holds for uniform sampling strategy.

Now we investigate correct selection probability in more detail:

julia> correct_wide = unstack(results, :runid, :alg, :correct)
20000×4 DataFrame
   Row │ runid  Greedy  Unif   Thompson
       │ Int64  Bool?   Bool?  Bool?
───────┼────────────────────────────────
     1 │     1   false   true      true
     2 │     2    true   true      true
     3 │     3   false   true      true
     4 │     4   false   true      true
     5 │     5   false  false     false
   ⋮   │   ⋮      ⋮       ⋮       ⋮
 19997 │ 19997   false   true      true
 19998 │ 19998   false  false      true
 19999 │ 19999   false   true      true
 20000 │ 20000   false   true      true
                      19991 rows omitted

julia> proptable(correct_wide, :Greedy, :Unif)
2×2 Named Array{Float64,2}
Greedy ╲ Unif │   false     true
──────────────┼─────────────────
false         │ 0.14145   0.4771
true          │ 0.05545    0.326

julia> proptable(correct_wide, :Greedy, :Thompson)
2×2 Named Array{Float64,2}
Greedy ╲ Thompson │   false     true
──────────────────┼─────────────────
false             │  0.0907  0.52785
true              │ 0.02975   0.3517

julia> proptable(correct_wide, :Unif, :Thompson)
2×2 Named Array{Float64,2}
Unif ╲ Thompson │   false     true
────────────────┼─────────────────
false           │  0.0893   0.1076
true            │ 0.03115  0.77195

Observe that Thompson and uniform sampling give quite similar results most of the time , with Thompson sampling having a slight edge when they disagree. Greedy strategy is much worse. It is easy to understand these findings. Greedy strategy is not doing enough exploration. Both Thompson and uniform sampling do explore. The edge of Thompson sampling is that it does not waste time testing very bad alternatives (so it can focus on the good ones and better discriminate between them).

Similarly we can investigate which alternative gives the best payoff (remember that we made sure to have the scenarios coupled):

julia> function find_best(alg, payoff)
           best_idxs = findall(==(maximum(payoff)), payoff)
           best_algs = alg[best_idxs]
           sort!(best_algs)
           return join(best_algs, '|')
       end
find_best (generic function with 1 method)

julia> alg_freq = @pipe groupby(results, :runid) |>
                        combine(_, [:alg, :payoff] => find_best => :best) |>
                        groupby(_, :best, sort=true) |>
                        combine(_, nrow)
3×2 DataFrame
 Row │ best             nrow
     │ String           Int64
─────┼────────────────────────
   1 │ Greedy            9736
   2 │ Greedy|Thompson    120
   3 │ Thompson         10144

We observe that uniform sampling for 20,000 simulations was never the best, while Thompson sampling has a slight edge over greedy sampling in the probability of being best (there is also a small probability of a tie).

As an exercise let us check, sticking to Bayesian approach, what is the probability that Thompson sampling has higher probability of being best than greedy one. Here we use the fact that the Dirichlet distribution is a conjugate prior for the Categorical distribution, and set uniform a priori beliefs:

julia> posterior = Dirichlet(alg_freq.nrow .+ 1)
Dirichlet{Float64}(alpha=[9737.0, 121.0, 10145.0])

julia> posterior_dis = rand(rng, posterior, 100_000)
3×100000 Array{Float64,2}:
 0.484744    0.486453    0.492359    …  0.494059   0.484432    0.482332
 0.00632059  0.00692795  0.00563844     0.0076129  0.00706873  0.00782947
 0.508936    0.506619    0.502002       0.498328   0.508499    0.509838

julia> mean(x -> x[1] < x[3], eachcol(posterior_dis))
0.99707

We can see that given the sample size of 20,000 this probability is greater than 99%.

However, we see that the considered fractions actually do not differ that much:

julia> transform(alg_freq, :nrow => (x -> x ./ sum(x)) => :freq)
3×3 DataFrame
 Row │ best             nrow   freq
     │ String           Int64  Float64
─────┼─────────────────────────────────
   1 │ Greedy            9736   0.4868
   2 │ Greedy|Thompson    120   0.006
   3 │ Thompson         10144   0.5072

and the difference in average payoffs we observed above was relatively higher. Let us dig into this. First we plot a distribution of difference between payoffs with Thompson and greedy sampling:

payoff_wide = unstack(results, :runid, :alg, :payoff)
Δpayoff = payoff_wide.Thompson .- payoff_wide.Greedy
hist(Δpayoff, 25)
xlabel("Δpayoff: Thompson-Greedy")

getting the following plot:

Delta of payoff

Now we see what is going on. If Thompson sampling is worse than greedy sampling it is worse only slightly. On the other hand greedy sampling can be much worse than Thompson sampling (and we understand when: this is a scenario when greedy algorithm gets stuck at some sub-optimal alternative).

We also quickly check the distribution of estimator of the difference in payoffs using Bayesian bootstrapping (to keep to the style we use in this post):

julia> dd = Dirichlet(length(Δpayoff), 1);
julia> boot = [dot(Δpayoff, rand(rng, dd)) for _ in 1:10_000];

julia> quantile(boot, 0:0.25:1)
5-element Array{Float64,1}:
 23.427965439713827
 24.64825057737705
 24.911299035010433
 25.17170629935465
 26.513561155964837

and we see, that, as expected the distribution of the difference is very significantly greater than zero (we could probably easily detect this significance even with a much smaller number of tests than 20,000).

Conclusion

That was a long post, so let me finish at this point. I hope you enjoyed it and could also appreciate the data wrangling capabilities that JuliaData ecosystem offers you. I also encourage you to run some more experiments on your own to chcek how the considered A/B testing strategies behave under different parameterizations.

And a usual caveat – performance geeks (are there any in the Julia community?) can find several spots in my codes that I left sub-optimal to keep them simple. Feel free to experiment with improving the execution speed of my examples.