Home Artificial Intelligence Bayesian Data Science: The What, Why, and How Prior Beliefs Bayes’ Rule

Bayesian Data Science: The What, Why, and How Prior Beliefs Bayes’ Rule

0
Bayesian Data Science: The What, Why, and How
Prior Beliefs
Bayes’ Rule

Selecting between frequentist and Bayesian approaches is the nice debate of the last century, with a recent surge in Bayesian adoption within the sciences.

Towards Data Science
Variety of articles referring Bayesian statistics in sciencedirect.com (April 2024) — Graph by the writer

What’s the difference?

The philosophical difference is definitely quite subtle, where some propose that the nice bayesian critic, Fisher, was himself a bayesian in some regard. While there are countless articles that delve into formulaic differences, what are the sensible advantages? What does Bayesian evaluation offer to the lay data scientist that the vast plethora of highly-adopted frequentist methods don’t already? This text goals to provide a practical introduction to the motivation, formulation, and application of Bayesian methods. Let’s dive in.

While frequentists take care of describing the precise distributions of any data, the bayesian viewpoint is more subjective. Subjectivity and statistics?! Yes, it’s actually compatible.

Let’s start with something easy, like a coin flip. Suppose you flip a coin 10 times, and get heads 7 times. What’s the probability of heads?

P(heads) = 7/10 (0.7)?

Obviously, here we’re riddled with low sample size. In a Bayesian POV nevertheless, we’re allowed to encode our beliefs directly, asserting that if the coin is fair, the possibility of heads or tails have to be equal i.e. 1/2. While in this instance the selection seems pretty obvious, the talk is more nuanced once we get to more complex, less obvious phenomenon.

Yet, this easy example is a robust start line, highlighting each the best profit and shortcoming of Bayesian evaluation:

Profit: Coping with a lack of information. Suppose you’re modeling spread of an infection in a rustic where data collection is scarce. Will you utilize the low amount of information to derive all of your insights? Or would you should factor-in commonly seen patterns from similar countries into your model i.e. informed prior beliefs. Although the selection is evident, it leads on to the shortcoming.

Shortcoming: the prior belief is hard to formulate. For instance, if the coin just isn’t actually fair, it might be mistaken to assume that P (heads) = 0.5, and there is sort of no solution to find true P (heads) with no long term experiment. On this case, assuming P (heads) = 0.5 would actually be detrimental to finding the reality. Yet every statistical model (frequentist or Bayesian) must make assumptions at some level, and the ‘statistical inferences’ within the human mind are literally rather a lot like bayesian inference i.e. constructing prior belief systems that factor into our decisions in every recent situation. Moreover, formulating mistaken prior beliefs is commonly not a death sentence from a modeling perspective either, if we are able to learn from enough data (more on this in later articles).

So what does all this appear to be mathematically? Bayes’ rule lays the groundwork. Let’s suppose we’ve got a parameter θ that defines some model which could describe our data (eg. θ could represent the mean, variance, slope w.r.t covariate, etc.). Bayes’ rule states that

Thomas Bayes formulated the Bayes’ theorem in 1700’s, published posthumously. [Image via Wikimedia commons licensed under Creative Commons Attribution-Share Alike 4.0 International, unadapted]

P (θ = t|data) ∝ P (data|θ = t) * P (θ=t)

In additional easy words,

  • P (θ = t|data) represents the conditional probability that θ is the same as t, given our data (a.k.a the posterior).
  • Conversely, P (data|θ) represents the probability of observing our data, if θ = t (a.k.a the ‘likelihood’).
  • Finally, P (θ=t) is solely the probability that θ takes the worth t (the infamous ‘prior’).

So what’s this mysterious t? It could actually take many possible values, depending on what θ means. In truth, you should try quite a lot of values, and check the likelihood of your data for every. It is a key step, and you actually really hope that you simply checked the very best possible values for θ i.e. those which cover the utmost likelihood area of seeing your data (global minima, for many who care).

And that’s the crux of every little thing Bayesian inference does!

  1. Form a previous belief for possible values of θ,
  2. Scale it with the likelihood at each θ value, given the observed data, and
  3. Return the computed result i.e. the posterior, which tells you the probability of every tested θ value.

Graphically, this looks something like:

Prior (left) scaled with the likelihood (middle) forms the posterior (right) (figures adapted from Andrew Gelmans Book). Here, θ encodes the east-west location coordinate of a plane. The prior belief is that the plane is more towards the east than west. The info challenges the prior and the posterior thus lies somehwere in the center. [image using data generated by author]

Which highlights the following big benefits of Bayesian stats-

  • We’ve an idea of your entire shape of θ’s distribution (eg, how wide is the height, how heavy are the tails, etc.) which may enable more robust inferences. Why? Just because we are able to not only higher understand but additionally quantify the uncertainty (as in comparison with a conventional point estimate with standard deviation).
  • Because the process is iterative, we are able to consistently update our beliefs (estimates) as more data flows into our model, making it much easier to construct fully online models.

Easy enough! But not quite…

This process involves quite a lot of computations, where you’ve got to calculate the likelihood for every possible value of θ. Okay, perhaps this is straightforward if suppose θ lies in a small range like [0,1]. We are able to just use the brute-force grid method, testing values at discrete intervals (10, 0.1 intervals or 100, 0.01 intervals, or more… you get the concept) to map your entire space with the specified resolution.

But what if the space is large, and god forbid additional parameters are involved, like in any real-life modeling scenario?

Now we’ve got to check not only the possible parameter values but additionally all their possible mixtures i.e. the answer space expands exponentially, rendering a grid search computationally infeasible. Luckily, physicists have worked on the issue of efficient sampling, and advanced algorithms exist today (eg. Metropolis-Hastings MCMC, Variational Inference) which can be in a position to quickly explore high dimensional spaces of parameters and find convex points. You don’t need to code these complex algorithms yourself either, probabilistic computing languages like PyMC or STAN make the method highly streamlined and intuitive.

STAN

STAN is my favorite because it allows interfacing with more common data science languages like Python, R, Julia, MATLAB etc. aiding adoption. STAN relies on state-of-the-art Hamiltonian Monte Carlo sampling techniques that virtually guarantee reasonably-timed convergence for well specified models. In my next article, I’ll cover the best way to start with STAN for easy in addition to not-no-simple regression models, with a full python code walkthrough. I may also cover the complete Bayesian modeling workflow, which involves model specification, fitting, visualization, comparison, and interpretation.

Follow & stay tuned!

LEAVE A REPLY

Please enter your comment!
Please enter your name here