Throughout the course of your career, you may need to quantify the effect of changes you make to a product or service. These changes can be the introduction of a new design to a website, the creation of a new medicine, or even choosing one football player to take a penalty kick for your team over another in the World Cup final. The way we quantify these changes is usually via A/B experiments.
For a website, this entails splitting traffic into two or more variants. Visitors in the A variant see the website’s original design, while those in the B variant see the new design. Now, there are two popular approaches to measuring the differences between theses variants statistically: the frequentist approach and the Bayesian approach.
Traditionally, the frequentist approach prevails. In this approach, you treat the traffic in each variant as a sample of all the traffic you will get in your website’s lifetime. Since we are dealing with samples, the means of the samples can be equal or different due to chance — hence, we must use statistical tests (such as Student t-test), confidence intervals, and p-values to figure out whether this is the case. The difference is statistically significant if we can prove that it is unlikely to be due to chance. If you are taking a frequentist approach, you will have to wait until you collect a certain amount of data before you can determine if a change is statistically significant.
On the other hand, if we use a Bayesian approach, we must think more in terms of beliefs. The probabilities we see in our samples are interpreted as beliefs about how different the two variants are. Beliefs can be initialized using our prior knowledge of our website visitors and updated using data. Rather than discussing whether the difference between the two samples is significant or not, we quantify our beliefs about how different the two samples are.
The key advantage of the Bayesian approach is that it is more intuitive and easy to explain to your stakeholders. You can communicate to them — from day one — the belief you have reached about your samples’ differences. You can then update your belief whenever new data is available, and it is up to you and your stakeholders to decide when you have enough belief in the difference. Additionally, you can make use of prior knowledge you have about your website visitors and plug it into your analysis to reach a quicker conclusion.
The Bayesian approach was not very popular in the past because it involves computationally expensive calculations. However, thanks to the powerful computers at our disposal today, those calculations are much easier to execute than they were 30 years ago.
To be able to run this code, you need to have Python installed. I used Python 3 for the code here, but I believe earlier versions should work too. You also need to install NumPy, Pandas, and Matplotlib.
Before we get into how to conduct a Bayesian analysis, it is important to understand probability distributions and what they are used for. As an example, take a soccer match where one team has to make a penalty kick. Player A has made eight kicks before and scored five times, so his accuracy is 62.5%. Meanwhile, Player B made 85 kicks and scored 50 times, and so his accuracy is 58.8%. Is it fair to confidently say that Player A is more accurate than Player B?
Before you answer, ask yourself if you’re equally confident about the numbers in these two scenarios. Since you saw Player B take many more penalty kicks than Player A, you know much more about Player B’s skill than Player A’s. As we will see in the next section, the beta distribution is useful in this case. Not only does it give us rates, it also shows us how confident we are in those rates.
The beta function takes two parameters, alpha (the number of successful penalty kicks) and beta (the number of missed penalty kicks). Of course, we can search the internet for the beta equation and substitute the alpha and beta parameters to plot the distribution. However, there is an easier way: We can ask our computer to generate random samples from the beta using the aforementioned parameters. We then can use those samples to find out anything we want about our distribution, or even compare two distributions to each other. This is called a Monte Carlo method, and given a certain distribution, we can use it and NumPy to generate thousands of samples, then compare the two players’ distributions.
Think of it this way: We are asking Player A to make eight kicks, and then another eight, and then another, all the way up to 10,000 sets of eight kicks. Similarly, for Player B, we are asking him to make 10,000 sets of 85 kicks. Then, we’ll compare the distributions of their penalty kick accuracies. Since we cannot ask the players to make such a huge number of penalty kicks in real life, we will use beta distribution parameterized by each of their scored and missed kicks.
import numpy as np
import pandas as pd
from numpy.random import beta
# How many random samples to generate
N = 10000
# We define alpha and beta for the two distributions
success_a = 5
failure_a = 3
success_b = 50
failure_b = 35
# Generate N samples using the predefined alphas and betas
a_samples = beta(success_a, failure_a, N)
b_samples = beta(success_b, failure_b, N)
# Convert to a Pandas Series for better handling later
a_samples = pd.Series(a_samples)
b_samples = pd.Series(b_samples)
# Plot the two distributions using kernel density estimation
pd.DataFrame(
{
'Player A': a_samples,
'Player B': b_samples,
}
).plot(
kind=’kde’,
title='Beta Distribution',
xlim=(0,1),
)
Now we have a better understanding of how these two players perform. According to the distribution means, Player A (orange) is slightly better than Player B (blue) on average. However, the orange distribution is narrower, which means that we have more confidence that Player B’s accuracy is somewhere above 40% and below 80%, while Player A’s accuracy is all over the place (it can be 20% or even 100%; we're not sure). Obviously, Player A’s accuracy is more likely to be above 60% based on his bigger probability mass. Now, as a coach, you still have a tough choice to make about which player to make the kick — but at least you have a better understanding of your data and how confident you can be in it.
Here are some examples of beta distributions and probability density functions:
When both alpha and beta equal one, we end up with a uniform distribution. Imagine that a player makes two penalty kicks, scores one, and missed the other. In this instance, it’s hard to be certain about his accuracy. However, when another player shoots 10 penalty kicks, scores five, and misses five, we are more certain of the 50% accuracy, even though it is still 50-50 like in the first case. This is why the purple probability density function is higher around 50%. The blue one is higher, narrower, and more certain because we’re working with more data — 30 penalty kicks.
Another distribution we need to have a look at is the binomial distribution. It answers a different question. Say we know that a player has a penalty kick accuracy of 80%. If he shoots 10 penalty kicks in a row, what are the chances that he will score only one out of all of them? What about the chances of only scoring two or three, and so on? Obviously, he cannot score 2.5 or 3.14 kicks, so the binomial is a discrete distribution.
One might argue that if his accuracy is 80%, then he should score eight out of 10. While the probabilities should indeed peak at eight, there is still a chance that he scores only seven or nine. It is hard to capture the real probability, especially with a low number of kicks.
In the graph below, the probability mass function for players has an accuracy of 20% and 80%, with both of them making 10 penalty kicks. Player A’s probabilities peak at two as expected, while B peaks at eight, but there is still a chance that they score a bit more or less than those numbers out of the 10 kicks.
Since this is a discrete distribution, we represent it with a probability mass function instead of a probability density function. As you can see, the graph is displayed as discrete bars and — since probabilities should add up to one — the sum of the bars’ heights for each player add up to one.
If the same Player A is allowed to make 100 kicks instead of 10, he will have a greater chance of showing his real (in)accuracy. In other words, we will have more certainty around once we've seen 20 correct kicks out of 100, compared to two out of 10, as demostrated in the next figure.
# Python Imports
import numpy as np
import pandas as pd
from numpy.random import binomial
# How many random samples to generate
N = 10000
# Binomial distribution parameters
n_a = 10
p_a = 0.2
n_b = 100
p_b = 0.2
# Generate N samples using the predefined N and p
a_samples = binomial(n_a, p_a, N)
b_samples = binomial(n_b, p_b, N)
# Convert to a Pandas Series for better handling later
a_samples = pd.Series(a_samples)
b_samples = pd.Series(b_samples)
# We get some values around 2 in the A samples
# And some values around 20 in the B sample
# We want to calculate how often each value appears
a_freq = a_samples.value_counts()
b_freq = b_samples.value_counts()
# For better visualisation
# We will pad some of missing frequencies
# We will set them to zero
a_freq = a_freq.reindex(range(11), fill_value=0)
b_freq = b_freq.reindex(range(41), fill_value=0)
# Sine we are calculating probabilities here
# We will divide by the total number of samples (N)
a_hist = a_freq / N
b_hist = b_freq / N
fig, axs = plt.subplots(1, 2, figsize=(16,6))
# We already calculated our histogram,
# So we will use a bar chart here instead.
pd.DataFrame(
{
'Player A; P=20% & n=100': a_hist,
}
).plot(
kind='bar',
title='Binomial Distribution (P=20%; n=10)',
xlim=(0, 10),
ax=axs[0],
)
pd.DataFrame(
{
'Player B; P=20% & n=100': b_hist,
}
).plot(
kind='bar',
title='Binomial Distribution (P=20%; n=100)',
xlim=(0, 50),
ax=axs[1],
)
Armed with our knowledge of the beta and binomial distributions, we can decide which player we should chose to take a penalty kick. This same methodology is what you would use to decide what changes to a website are more likely to lead to higher conversions, or what medicine is more likely to cure more patients. What we need to calculate is the true accuracy of a player given the data we have about them. Remember, the player cannot take an infinite number of penalty kicks — only a few dozen, if we are lucky. Consequently, what we are after is an estimate of the player’s accuracy.
Let's call Player A's accuracy
The pipe “|” here means given, or conditional data.
We can calculate these two probabilities for the two players thanks to Bayes' rule:
Since we are comparing two players, and P(Date) is not a function of either (it is the same for both of them), we can get rid of it in our comparison:
I am using := instead of the proportionality sign (∝) for better readability.
In this case,
Now, we must multiply the prior (beta distribution) by the likelihood (binomial distribution). The product is called the posterior. It's important to note, however, that there are some interesting facts you need to know about these two particular distributions. The product of a beta with parameters (alpha and beta) and a binomial, where we know that a player with accuracy P was seen scoring K kicks out of N, ends up being another beta distribution with parameters: (
How can I speak about soccer nowadays without mentioning Lionel Messi and Cristiano Ronaldo? Let’s compare these two players’ penalty kick accuracy. It's always a good idea to validate your data, but for the sake of this example, I’ve found some information about these players on an obscure website. According to the website, Messi scored in 81 penalty kicks and missed 24 of them, while Ronaldo scored 109 and missed 23. Knowing these numbers, we can easily calculate and plot the posterior distributions of the two players and compare them.
# Python Imports
import numpy as np
import pandas as pd
from numpy.random import binomial
from numpy.random import beta
# Our priors Beta distribution parameters
prior_alpha = 1
prior_beta = 1
# How many kicks were scored
successes_messi = 81
successes_ronaldo = 109
# How many kicks were not scored
fails_messi = 24
fails_ronaldo = 23
# Calculate Posterior := Likelihood * Prior
messi_samples = beta(
a = prior_alpha + successes_messi,
b = prior_beta + fails_messi,
size=10000
)
ronaldo_samples = beta(
a = prior_alpha + successes_ronaldo,
b = prior_beta + fails_ronaldo,
size=10000
)
fig, ax = plt.subplots(1, 1, figsize=(12,8))
pd.DataFrame(
{
'Posterior Messi': messi_samples,
'Posterior Ronaldo': ronaldo_samples,
}
).plot(
kind='hist',
title='Messi vs Ronaldo Penalty Shootout Accuracy',
bins=100,
alpha=0.5,
ax=ax,
)
# Add lines to show the medians of each distribution
ax.vlines(
x=np.median(messi_samples), ymin=0, ymax=500, color='red', alpha=0.75,
)
ax.vlines(
x=np.median(ronaldo_samples), ymin=0, ymax=500, color='blue', alpha=0.75,
)
fig.show()
At this point, you might be wondering about the "size" of the parameter used and why I called the posterior a distribution. Imagine that we are able to clone people, and we made 10,000 clones each of Messi and Ronaldo; hence the size is set to 10,000. Each of these clones is generated from the aforementioned posterior beta distributions. What we are comparing now is the distributions of those 20,000 clones.
We can see that there is overlap between the two distributions, but what is more important is quantifying the difference between the distributions. Remember, we are now comparing one Messi clone to one Ronaldo clone at a time, and repeating this process for 10,000 clones. We want to figure out how often one clone is better than the other, and the distributions of those differences. Thus, we are going to subtract the two posterior samples from each other and plot the difference. We can also draw a vertical red line at zero; the probability mass to the right of red line represents how often in our samples Ronaldo is more accurate. Or in Bayesian terms, it represents our confidence in Ronaldo being more accurate. The black line is at the median of the distribution of the differences, as it shows the expected value of how accurate Ronaldo is compare to Messi.
# Calculate the difference between the two samples
samples_diff = pd.Series(ronaldo_samples) - pd.Series(messi_samples)
# Find the median of the difference
samples_diff_median = samples_diff.median()
fig, ax = plt.subplots(1, 1, figsize=(12,8))
# Plot the samples difference
samples_diff.plot(
kind='hist',
title='Diff (Ronaldo - Messi)',
bins=100,
color='Gray',
alpha=0.75,
ax=ax,
)
ax.set_xlabel("Expected Difference in Rates")
# Draw a black line at the median of the difference
diff_vline_min, diff_vline_max = ax.get_ylim()
ax.plot(
(samples_diff_median, samples_diff_median),
(diff_vline_min, diff_vline_max),
'Black',
alpha=0.5,
)
# Draw a red line at zero
ax.plot(
(0, 0),
(diff_vline_min, diff_vline_max),
'Red',
alpha=0.5,
)
fig.show()
There is a much bigger probability mass to the red line’s right, which shows more confidence in Ronaldo having better penalty shootout skills. We can quantify the difference between the two players’ accuracy and the confidence in that difference as follows:
print(
'Confidence that Ronaldo > Messi: {}%'.format(
(
100 * ((b_samples - a_samples) > 0).mean()
).round(2)
)
)
print(
'Difference (Ronaldo - Messi) Median {}%'.format(
round(100 * samples_diff_median, 2)
)
)
# Confidence that Ronaldo > Messi: 85.58%
# Difference (Ronaldo - Messi) Median 5.39%
The outcome of penalty kicks is binary; you either scored or didn't score. What about something like darts? You throw a dart and you can get a different score depending on where it hits the board. In a case like this one, our favorite beta and binomial distributions won’t work. Thankfully, there are two other distributions that are fit for this scenario: Categorical distribution is the equivalent of the binomial distribution and Dirichlet is the non-binary version of beta. And yes, you guessed it, the Dirichlet and categorical distributions are also conjugates. It's fairly easy to replace the code above with the new distributions and figure out how to parameterize the posterior Dirichlet distribution with the prior and likelihood parameters.
It’s clear that bayesian analysis has its perks. It is easy and more informative than traditional p-values. You can actually see the difference between two treatments you are comparing — and you're able to quantify the confidence you have in this difference. You also can use a more informed prior if you want to include your prior belief in a certain outcome. There are plenty of Python libraries to help you calculate and plot your analysis. I used NumPy here, but PyMC3 is a good — and more advanced — alternative.
As we have seen, the most important part of this work is to think clearly about what you are measuring and find the proper distributions to represent your prior knowledge and the likelihood of an event. We have touched on both binary and categorical events in this post, but I would be happy to hear from readers about different kind of events they test and what kind of distributions they find to be more suitable for them. Wikipedia has a helpful list of conjugate distributions you can look at for inspiration.
I wish you the best of luck analyzing your next experiment the Bayesian way!
Tarek Amr is based in the Netherlands, where he works as a machine learning scientist. His book, Practical D3.js, provides step-by-step guidance to mastering the efficient use of D3.js in data visualization projects. Tarek is passionate about open data initiatives and is happy to respond to questions and comments on LinkedIn.
Previous Post
Next Post