**AB Testing Using Pyro**

Consider an organization that has designed a brand new website landing page and needs to know the impact this may have on conversion, i.e. do visitors proceed their web session on the web site after landing on the page? In test group A, website visitors will probably be shown the present landing page. In test group B, website visitors will probably be shown the brand new landing page. In the remaining of the article, I’ll seek advice from test group A because the control group, and group B because the treatment group. The business is sceptical in regards to the change and has opted for an 80/20 split in session traffic. The full number of holiday makers and the whole variety of page conversions for every test group are summarised below.

The null hypothesis of the AB test is that there will probably be no change in page conversion for the 2 test groups. Under the frequentist framework, this is able to be expressed as the next for a two-sided test, where r_c and r_t are the page conversion rates within the control and treatment groups, respectively.

A significance test would then seek to either reject or fail to reject the null hypothesis. Under the Bayesian framework, we express the null hypothesis barely otherwise by asserting the identical *prior* for every of the test groups.

Let’s pause and description exactly what is occurring during our test. The variable we’re excited about is the page conversion rate. This is solely calculated by taking the variety of distinct converted visitors over the whole number of holiday makers. The event that generates this rate is whether or not the visitor clicks through the page. There are only two possible outcomes here for every visitor, either the visitor clicks through the page and converts, or doesn’t. A few of you would possibly recognise that for every distinct visitor, that is an example of a Bernoulli trial; there’s one trial and two possible outcomes. Now, after we collect a set of those Bernoulli trials, we’ve got a binomial distribution. When the random variable X has a binomial distribution, we give it the next notation:

Where n is the number of holiday makers (or the variety of Bernoulli trials), and p is the probability of the event on each trial. p is what we’re excited about here, we wish to know what the probability of a visitor converting on the page is in each test group. We now have observed some data, but as mentioned within the previous section, we first must define our prior. As all the time in Bayesian statistics, we want to define this prior as a probability distribution. As mentioned before, this probability distribution is a characterisation of our uncertainty. Beta distributions are commonly used for modelling probabilities, because it is defined between the intervals of [0,1]. Moreover, using a beta distribution as our prior for a binomial likelihood function gives us the helpful property of conjugacy, which implies our posterior will probably be generated from the identical distribution as our prior. We are saying that the beta distribution is a *conjugate *prior. A beta distribution is defined by two parameters, alpha, and confusingly, beta.

With access to historical data, we will assert an informed prior. We don’t necessarily need historical data, we could use our intuition to tell our understanding, but for now let’s assume we’ve got neither (later on this tutorial we are going to use informed priors, but to exhibit the impact, I’ll start with the uninformed). Let’s assume we’ve got no understanding of the conversion rate on the corporate’s site, and subsequently define our prior as Beta(1,1). This is named a flat prior. The probability distribution of this function looks just like the graph below, the identical as a uniform distribution defined between the intervals [0,1]. By asserting a Beta(1,1) prior, we are saying that every one possible values of the page conversion rate are equally probable.

We now have all the data we want, the priors, and the information. Let’s jump into the code. The code provided herein will provide a framework to start with AB testing using Pyro; it subsequently neglects some features of the package. To assist optimise your code further and take full advantage of Pyro’s capabilities, I like to recommend referring to the official documentation.

First, we want to import our packages. The ultimate line is nice practice, particularly when working in notebooks, clearing the shop of parameters we’ve got built up.

`import pyro`

import pyro.distributions as dist

from pyro.infer import NUTS, MCMC

import torch

from torch import tensor

import matplotlib.pyplot as plt

import seaborn as sns

from functools import partial

import pandas as pdpyro.clear_param_store()

Models in Pyro are defined as regular Python functions. This is useful because it makes it intuitive to follow.

`def model(beta_alpha, beta_beta):`

def _model_(traffic: tensor, number_of_conversions: tensor):

# Define Stochastic Primatives

prior_c = pyro.sample('prior_c', dist.Beta(beta_alpha, beta_beta))

prior_t = pyro.sample('prior_t', dist.Beta(beta_alpha, beta_beta))

priors = torch.stack([prior_c, prior_t])

# Define the Observed Stochastic Primatives

with pyro.plate('data'):

observations = pyro.sample('obs', dist.Binomial(traffic, priors),

obs = number_of_conversions)

return partial(_model_)

A number of things to interrupt down and explain here. First, we’ve got a function wrapped inside an outer function, the outer function returns the partial function of the inner function. This permits us to vary our priors, without having to vary the code. I even have referred to the variables defined within the inner function as primitives, consider primitives as variables within the model. We now have two kinds of primitives within the model, stochastic and observed stochastic. In Pyro, we do not need to explicitly define the difference, we simply add the obs argument to the sample method when it’s an observed primitive and Pyro interprets it accordingly. Observed primitives are contained throughout the context manager pyro.plate(), which is best practice and makes our code look cleaner. Our stochastic primitives are our two priors, characterised by Beta distributions, governed by the alpha and beta parameters that we pass in from the outer function. As previously mentioned, we assert the null hypothesis by defining these as equal. We then stack these two primitives together using tensor.stack(), which performs an operation akin to concatenating a Numpy array. This may return a tensor, the information structure required for inference in Pyro. We now have defined our model, now let’s move onto the inference stage.

As previously mentioned, this tutorial will use MCMC. The function below will take the model that we’ve got defined above and the variety of samples we wish to make use of to generate our posterior distribution as a parameter. We also pass our data into the function, as we did for the model.

`def run_infernce(model, number_of_samples, traffic, number_of_conversions):`

kernel = NUTS(model)mcmc = MCMC(kernel, num_samples = number_of_samples, warmup_steps = 200)

mcmc.run(traffic, number_of_conversions)

return mcmc

The primary line inside this function defines our kernel. We use the NUTS class to define our kernel, which stands for No-U-Turn Sampler, an autotuning version of Hamiltonian Monte Carlo. This tells Pyro learn how to sample from the posterior probability space. Again, it’s beyond the scope of this text to dive deeper into this topic, but for now, it’s sufficient to know that NUTS allows us to sample from the probability space intelligently. The kernel is then used to initialise the MCMC class on the second line, specifying it to make use of NUTS. We pass the number_of_samples argument within the MCMC class which is the variety of samples used to generate the posterior distribution. We assign the initialised MCMC class to the mcmc variable and call the run() method, passing our data as parameters. The function returns the mcmc variable.

That is all we want; the next code defines our data and calls the functions we’ve got just made using the Beta(1,1) prior.

`traffic = torch.tensor([5523., 1379.])`

conversions =torch.tensor([2926., 759.])

inference = run_infernce(model(1,1), number_of_samples = 1000,

traffic = traffic, number_of_conversions = conversions)

The primary element of the traffic and conversions tensors are the counts for the control group, and the second element in each tensor is the counts for the treatment group. We pass the model function, with the parameters to manipulate our prior distribution, alongside the tensors we’ve got defined. Running this code will generate our posterior samples. We run the next code to extract the posterior samples and pass them to a Pandas dataframe.

`posterior_samples = inference.get_samples()`

posterior_samples_df = pd.DataFrame(posterior_samples)

Notice the column names of this dataframe are the strings we passed after we defined our primitives within the model function. Each row in our dataframe incorporates samples drawn from the posterior distribution, and every of those samples represents an estimate of the page conversion rate, the probability value p that governs our Binomial distribution. Now we’ve got returned the samples, we will plot our posterior distributions.

**Results**

An insightful strategy to visualise the outcomes of the AB test with two test groups is by a joint kernel density plot. It allows us to visualise the density of samples within the probability space across each distributions. The graph below could be produced from the dataframe we’ve got just built.

The probability space contained within the graph above could be divided across its diagonal, anything above the road would indicate regions where the estimation of the conversion rate is higher within the treatment group than the control and vice versa. As illustrated within the plot, the samples drawn from the posterior are densely populated within the region which might indicate the conversion rate is higher within the treatment group. It is crucial to spotlight that the posterior distribution for the treatment group is wider than the control group, reflecting the next degree of uncertainty. It is a results of observing less data within the treatment group. Nevertheless, the plot strongly indicates that the treatment group has outperformed the control group. By collecting an array of samples from the posterior and taking the element-wise difference, we will say that the probability that the treatment group outperforms the control group is 90.4%. This figure suggests that 90.4% of the samples drawn from the posterior will probably be populated above the diagonal within the joint density plot above.

These results were achieved through the use of a flat (uninformed) prior. Using an informed prior may help improve the model, particularly when the provision of observed data is proscribed. A helpful exercise is to explore the results of using different priors. The plot below shows the Beta(2,2) probability density function and the joint plot it produces after we rerun the model. We will see that using the Beta(2,2) prior produces a really similar posterior distribution for each test groups.

The samples drawn from the posterior suggest there’s a 91.5% probability that the treatment group performs higher than the control. Subsequently, we do imagine with the next degree of certainty that the treatment group is best than the control versus using a flat prior. Nonetheless, in this instance the difference is negligible.

There’s one other thing I would really like to spotlight about these results. After we ran the inference, we told Pyro to generate 1000 samples from the posterior. That is an arbitrary number, choosing a distinct variety of samples can change the outcomes. To spotlight the effect of accelerating the variety of samples, I ran an AB test where the observations from the control and treatment groups were the identical, each with an overall conversion rate of fifty%. Using a Beta(2,2) prior generates the next posterior distributions as we incrementally increase the variety of samples.

After we run our inference with just 10 samples, the posterior distribution for the control and treatment groups are relatively wide and adopt different shapes. Because the variety of samples that we draw increases, the distributions converge*, *eventually generating nearly equivalent distributions. Moreover, we observe two properties of statistical distributions, the central limit theorem and the law of huge numbers. The central limit theorem states that the distribution of sample means converges towards a traditional distribution because the variety of samples increases, and we will see that within the plot above. Moreover, the law of huge numbers states that because the sample size grows, the sample mean converges towards the population mean. We will see that the mean of the distributions in the underside right tile is roughly 0.5, the conversion rate observed in each of the test samples.