Introduction
About this blog post
This blog article is intended as a hands-on tutorial on how to conduct Bayesian inference. Bayesian inference is a method for learning the values of parameters in statistical models from data. Bayesian inference / data analysis is a fully probabilistic approach – the outcome of which are probability distributions. Another distinctive feature of Bayesian inference is the use of prior information in the analyses. Bayesian inference is not concerned with one particular statistical model – in fact, a multitude of different statistical / ML-models (be it: Linear Regression, Logistic Regression, Artificaial Neural Networks,…) can be “trained” in a either a Bayesian or a Non-Bayesian way. Bayesian inference can be summarized as being a method for learning from data.
This blog-posts consists of two parts, the first one will give a very brief recapitulation of the history of Bayesian data analysis, as well as a few words on the differences between Bayesian methods and the more common Frequentist methods. The first part will conclude with the derivation of Bayes’ Theorem, an example to demonstrate the application of Bayes’ Theorem as well as a short introduction of how it can be used to tackle data-analytical problems.
Python has been chosen as a programming language (R would arguably be the first alternative) and Stan (Python interface: PyStan) will be used as a tool for specifying Bayesian models and conducting the inference.
The main intention is to give someone who has previously not performed a Bayesian inference on a dataset the means to do so and hopefully also convincing arguments for why anyone would want to do so.
Short history
The roots of Bayesian statistics date back to the 18th century, when Reverend Thomas Bayes discovered/invented Bayes Theorem. The famous French mathematician Pierre-Simon Laplace was one of the first proponents of Bayesian data analysis and developed the methods further as well as popularized them.
Especially in the early 20th century, Bayesian Statistics was exposed to fierce opposition. One of the most prominent opponents of Bayesian Statistics – or “inverse probability” – was the highly influential statistician Ronald Fisher. This strong rejection of Bayesian statistics is presumably responsible for its relative demise in the 20th century. Furthermore, Bayesian inference of real-world models with potentially hundreds or thousands of free parameters are computationally significantly more expensive than their alternative, which almost certainly constituted another reason for the vast dominance of frequentist statistics in the past.
Apart from tailwinds due to constantly increasing computational power, also the advent of computational tools for conducting Bayesian inference (among the first were BUGS [1] and JAGS [2]) has led relatively recently to an increase in the popularity of Bayesian methods.
These tools for conducting Bayesian inference are making heavy use of Markov Chain Monte Carlo (MCMC) methods [3] – methods that have been mostly developed by scientists involved in the Manhattan project in the years after the end of the project. Initially used to simulate physical systems, they were later used in statistics – for example Bayesian inference. One of the scientists strongly involved in the invention of MCMC methods was the Polish mathematician Stanislaw Ulam – after whom the probabilistic programming language Stan [4,5] was named. PyStan [6] is Stan’s Python interface.
Bayesian methods are being used in a large variety of domains and to a diverse set of problems – from modelling the spread of infectious diseases [7] to analysing astrophysical data [8].
Bayesian Statistics vs. Frequentist Statistics
Frequentist statistics (sometimes also called Fisherian statistics after the aforementioned statistician) , which make use of Maximum Likelihood methods have been the default method of choice in statistical inference for large parts of the 20th century.
Given a dataset, Maximum Likelihood Estimation (MLE) starts with a model for the probability / Likelihood of the data. This model will have unknown parameters. In Maximum Likelihood Estimation, optimization methods are used to obtain the parameter values that maximize the Likelihood of the data. Just to stretch this point – we do not obtain the parameters that are most probable given the data, but instead the parameters that make the data most likely.
Similar to Maximum Likelihood methods, Bayesian methods start with a model for the likelihood of the observed data. Additionally, a Bayesian model requires the specification of a probability distribution over the parameters. This probability distribution can be thought of as the “belief” in the parameter before any data (at least the data to be analyzed in a particular analysis) has been observed. It is usually called the prior distribution or a priori distribution. Bayes theorem is subsequently used to take the influence of the observed data into consideration – a distribution over the parameters before data has been observed (the prior distribution) is transformed into a probability distribution of the parameters after the data has been observed (the posterior distribution).
The use of prior distributions can be seen as one of the strengths of Bayesian inference – it for example provides for regularization and thus “stabilizes” statistical inference. Many approaches to regularization in MLE (such as Lasso or Ridge regression) can be understood in a meaningful way when taking the Bayesian viewpoint (see e.g. [9]). Yet, one of the most prominent arguments against Bayesian inference is, that prior distributions introduce subjectivity into the analysis – the data are not speaking for themselves anymore, but the analysis is influenced by the analyst himself. The usual response of Bayesians to this argument is that already the choice of the details of the likelihood function will have an influence on the analysis. The argument could be summarized as: every data analysis will be influenced by the analyst – the Bayesian approach is very transparent about this.
One of the main advantages of Bayesian data analysis is the ease of incorporation of prior information. Assume that you are analysing a scientific experiment and – let’s say you want to estimate the value of some physical constant (e.g. : the speed of light ). Now, assume that you have already analysed data from a previous experiment. Now, you might not want to trash the insights you gained from the first experiment, but instead take it into consideration when analysing the data of the second experiment. Bayesian statistics and specifically the use of prior distributions provide you with an easy way of achieving this – the first experiments’ posterior distribution of the parameter (speed of light) will become the prior distribution for the parameter on the inference of the second experiment’s data. We’ve learned something from the analysis of the first experiment and don’t ignore this knowledge, but use it as a starting point for the interpretation of data from further analyses.
Bayesian inference thus does not only provide you with results that are easier to interpret, but also gives you an easy way of combining the results of different analyses.
Bayesian Statistics – the mechanics
Bayes’ Theorem
Bayes’ Theorem deals with conditional and marginal probabilities. One of the prototypical examples for illustrating Bayes’ Theorem deals with tests for a disease and can thus be easily turned into a currently relevant (during the COVID-19 – crisis 2020 if you are reading this in the far future ) example.
Example: Testing for diseases
Let’s assume we are dealing with a test (T) for past infections (PI) with the virus that causes COVID-19. It is supposed to check whether an individual person already has overcome an infection with the SARS-CoV-2.
Individual people are taking this test and their result is either negative (T=0) or positive (T=1). At the same time, people either already have successfully fought the virus (PI = 1) or have not been infected yet (PI=0). For simplicity’s sake, let’s ignore the edge case of people who are currently infected.
Sensitivity and Specificity
The quality of the test can be summarized by its sensitivity – how many of the people who are actually infected get a positive test result – and its specificity. The specificity tells you, what percentage of people who had not had the virus get a negative test result. Let’s assume that the test has a sensitivity of 98% and a specificity of 90%.
Sensitivity and specificity can be expressed as conditional probabilities. The sensitivity is the probability that a test result is positive given that the tested individual has previously been infected:
$p(\text{T}=1|\text{PI}=1) = 0.98$
The specificity is the conditional probability that a test-result is negative, given that the person has not been previously infected:
$p(\text{T}=0|\text{PI}=0) = 0.9$
It is obvious that a good test should have both – a high sensitivity and a high specificity. It’s for example very easy to achieve a high sensitivity (just design a test that always returns a positive result). This test would have a perfect sensitivity, since it would give a positive result for everyone who has had the virus. Yet, it’s specificity would be abysmal, since it would also give everyone who has not had the virus a positive test result.
Now, the answer we might really be interested in, is, with which probability a person who has a positive test result has actually been infected. It’s in a sense the “inverse” of the probability of having a positive test result given that someone has been infected. Remember, Bayesian analysis has also been called “inverse probability”.
So, we are interested in the quantity p(PI=1|T=1). As you may have guessed – Bayes’ Theorem will help us in obtaining the desired answer.
Derivation of Bayes’ Theorem
In general, any joint probability distribution of two random variables (such as PI and T) can be factored in two ways. One way is:
$p(\text{T},\text{PI}) = p(\text{T}|\text{PI})p(\text{PI})$
Just to clarify this further, let’s look at the case of a positive test result (T=1) for a previously infected person (PI=1):
$p(\text{T}=1,\text{PI}=1) = p(\text{T}=1|\text{PI}=1)p(\text{PI}=1)$
The probability that a person has a positive test result and has had the virus can be expressed as the probability that a person has had the virus (p(PI=1)) times the probability that a test result was positive given that the person has had the virus (p(T=1|PI=1)).
The joint probability distribution p(T,PI) can also be factored in another way:
$p(\text{T},\text{PI}) = p(\text{PI}|\text{T})p(\text{T})$
To clarify this one further, let’s look again at the case of a positive test result (T=1) for somebody who has been previously infected:
$p(\text{T}=1,\text{PI}=1) = p(\text{PI}=1|\text{T}=1)p(\text{T}=1)$
The probability that an individual has a positive test result and has had the virus can also be expressed as the probability of a positive test result times the probability that someone actually has previously been infected given that the test result was positive.
We can combine both factorizations to obtain:
$p(\text{T}|\text{PI})p(\text{PI}) = p(\text{PI}|\text{T})p(\text{T})$
Solving it for p(PI|T):
$p(\text{PI}|\text{T}) = \frac{p(\text{T}|\text{PI})p(\text{PI})}{p(\text{T})}$
Et voilà – that’s one way to write down Bayes’ Theorem. It gives us a way to reason about the probability of previous infection given a test result, when everything we have to start with are the probabilities for test-results given the states of previous infection. Well, we also need a way to come up with values for p(PI), the marginal probability of previous infection and p(T), the marginal probability for test results.
Let’s assume that Germany has done a pretty good job so far – and the roughly 200 000 cases represent roughly the number of infections we have had. Assuming we have 80 million people (yep, I am not being precise here, want to have some nice numbers) – that would mean the chance, that a randomly selected person has already had the virus is around 1:400 (or 0.25% ). Now, let’s assume we randomly test people in Germany with the aforementioned antibody tests and want to know the probability that somebody who has been tested positive actually has had the virus:
$p(\text{PI}=1|\text{T}=1) = \frac{p(\text{T}=1|\text{PI}=1)p(\text{PI}=1)}{p(\text{T}=1)}$
What is the marginal probability p(T=1)? Well, it is the probability that a test has been positive, which can be expressed as the probability that a test has been positive given that the person was infected times the probability that an individual was infected plus the probability that a test has been positive given the person has not been previously infected times the probability that an individual has not been previously infected…. Long story short:
$p(\text{T}=1) = p(\text{T}=1|\text{PI}=1)p(\text{PI}=1) + p(\text{T}=1|\text{PI}=0)p(\text{PI}=0)$
So, putting it all together, we obtain:
$p(\text{PI}=1|\text{T}=1) =\frac{p(\text{T}=1|\text{PI}=1)p(\text{PI}=1)}{p(\text{T}=1|\text{PI}=1)p(\text{PI}=1) + p(\text{T}=1|\text{PI}=0)p(\text{PI}=0)}$
Plugging in the actual numbers:
$p(\text{PI}=1|\text{T}=1) =\frac{0.98 \cdot 0.0025}{0.98 \cdot 0.0025 + 0.1 \cdot 0.9975} \approx 0.024 = 2.4 \%$
So, that’s it – despite the fact that we have a test with high sensitivity and reasonably high specificity, the probability that someone with a positive test-result actually has had SARS-CoV-2 is only 2.4%. Or, putting it slightly differently – around 98% of people who receive a positive test result will actually never have been infected. The reason for this maybe surprising result is that the vast majority of people who are tested never had the virus. Even if the probability for a false positive is only 10%, the absolute number of false positives will largely outnumber the true positives → thus a given positive result is very likely attributable to a false positive and not a true positive result.
Agreed – the estimate of the prior probability that a randomly selected person has had the virus is coarse at best – it’s pretty much guesswork in this case. Yet, even if it were quite a bit higher – false positives would still outnumber true positives and Bayes’ Theorem would still rightfully tell us that we should not take the positive result at face value.
Now, whether this will play a role in testing for SARS-CoV-2 – or if the tests will have such high sensitivities and specificities that “we don’t have to worry about statistics” still remains to be seen. There is evidence from the FDA that some of the tests are very good indeed and these numbers given here are quite a bit too negative [10]. Yet, it is known for example that positive results in Mammograms for young women (around the age of 40) are largely dominated by false positives. Numbers might vary quite a bit with exact age, source,…. – but roughly only every 13th positive result is actually a true positive [11]. The reasoning is the same as in the example above – the prior is strongly skewed in the direction of “no breast cancer” – a relatively small imperfection in the specificity will thus lead to way more false positives than true positives. While certainly nobody should (or would) take such a positive test result lightly, the odds after one positive result would still be largely in the young woman’s favor and with more than 90% probability she would not have breast cancer. One way of addressing this is by applying multiple tests.
Continuous Form
So far the whole discussion has been about discrete random variables – we used binary outcomes such as sick / healthy or positive / negative. The form of Bayes’ Theorem for this case has been:
$p(A|B) = \frac{p(B|A) p(A)}{p(B)}$
where p(B) could be expressed as a sum:
$p(A|B) = \frac{p(B|A) p(A)}{\sum_{A^{*}} p(B|A^{*})p(A^{*})}$
If we go from discrete to continuous cases (look at continuous random variables, such as a person’s height / weight, an employee’s income / ….), the discrete probability distributions will become probability densities and the sum will turn into an integral:
$p(A|B) = \frac{p(B|A) p(A)}{\int p(B|A^{*})p(A^{*})dA^{*}}$
Bayesian data analysis
So far, we have only introduced Bayes’ Theorem. There are usually not that many cases in a data analyst’s daily business where she/he/… has to calculate probabilities about tests for diseases. So, it’s perfectly fair to ask: Why should data analysts / data scientists / data … / business … care about Bayes’ Theorem?
The point is, that many data-analytical problems can be seen as the attempt to infer the value of a model’s (yeah, this kind of model) parameters from data. Well, that’s exactly what Bayes’ Theorem can provide:
$p(\theta|D) = \frac{p(D|\theta) p(\theta)}{p(D)}$
Theta denotes here the (potentially multivariate) parameter(s) and D is simply our (very likely multivariate) data. It basically means, that we can get a probability distribution for our parameters given the data (p(Theta|Data)). This distribution is conventionally called the posterior (after having seen the data). The distribution p(Theta) is our prior distribution (what credibility do we assign to certain parameter values before we have seen the data?). The quantity p(Data|Theta) is the Likelihood. Frequentists would simply use optimization to obtain the parameters Theta* that maximize the likelihood (or the log-likelihood to make things computationally easier). The normalization constant p(Data) is conventionally called evidence or marginal likelihood (we are integrating / marginalizing the likelihood p(Data|Theta) over Theta).
Bayesians also use the likelihood, but in addition they have to specify also prior distributions for their parameters. In return, they obtain an entire distribution over the parameters and not (only) point estimates for the parameter values that maximize the likelihood.
Sampling and Markov Chain Monte Carlo methods
Unfortunately, it is in most practical cases impossible to obtain an analytical solution for the posterior distribution. The denominator in the continuous form of Bayes’ Theorem consists of an integral over a potentially high-dimensional space. Given that you have 50 free parameters (rather realistic number of parameters) in a model, you will have to perform an integral over a space with 50 dimensions. Most of the times you will not be able to solve this analytically. Also, simple numerical methods (integration over “a grid”) will fail – after all, even if you have only 10 dimensions, a grid with 1000 points per dimension will already have 1000^10 points in total, which already renders grid integration computationally unfeasible.
One of the ways of tackling this, is by obtaining samples from the posterior distribution. These samples are distributed according to the the posterior distribution. You can basically calculate every quantity that you would want to calculate from the posterior distribution from these samples. What is the mean value of the distribution? Well, gather enough (independent) samples and calculate their mean. Standard deviations, … – same thing, just calculate the quantity from the obtained samples.
Usually so-called Markov Chain Monte Carlo (MCMC) methods (a great explanation is given in [12]) are used to obtain samples from a distribution. These methods start from initial values for the parameters. Afterwards, new values are proposed and accepted according to certain proposal and acceptance functions. This process of proposal and acceptance is iterated over many times to create a large number of samples. Different methods mostly differ in their implementations of the proposal and acceptance functions.
MCMC methods may need some post-processing of the obtained samples – e.g. it can be necessary to ignore the first (often hundreds or thousands) samples, because “they have been collected far away from the relevant parameter values”. Also, samples are usually correlated and thus often only every Nth sample (after the removal of the initial samples) is retained. Furthermore, the methods can fail (to give representative samples of the distribution). There are ways to diagnose this, such as creating running multiple “sampling-chains” (and comparing their outcome). I will not go into details in this blog post – it’s an introduction with simple examples after all.
Summary
This first part gave a short review of the history of Bayesian statistics, as well as the derivation of Bayes’ Theorem. After an illustrative example for the application of Bayes’ Theorem, the transition to data-analytical applications of Bayes’ Theorem was made. We concluded with a very brief discussion of Markov Chain Monte Carlo methods.
I hope you enjoyed reading this first part – if so, please don’t forget to read part II.
Links, References:
[1] https://www.mrc-bsu.cam.ac.uk/software/bugs/
[2] http://mcmc-jags.sourceforge.net/
[3] https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo
[5] Bob Carpenter et al., 2017. Stan: A probabilistic programming language. Journal of Statistical Software 76(1). DOI 10.18637/jss.v076.i01
[6] https://pystan.readthedocs.io/en/latest/
[7] Nature: Special report: The simulations driving the world`s response to COVID-19
(contains a link to the following (Bayesian) report):
https://spiral.imperial.ac.uk:8443/bitstream/10044/1/77731/10/2020-03-30-COVID19-Report-13.pdf
[8] https://statmodeling.stat.columbia.edu/2016/02/15/the-recent-black-hole-ligo-experiment-used-pystan/
[9] https://bjlkeng.github.io/posts/probabilistic-interpretation-of-regularization/
[12] Richard McElreath: Statistical Rethinking: A Bayesian course with examples in R and Stan.