An introduction to Bayesian Statistics and its application

5
At the heart of our visual identity is the Oxford logo.
It should appear on everything we produce, from
letterheads to leaflets and from online banners to
bookmarks.
The primary quadrangle logo consists of an Oxford blue
(Pantone 282) square with the words UNIVERSITY OF
OXFORD at the foot and the belted crest in the top
right-hand corner reversed out in white.
The word OXFORD is a specially drawn typeface while all
other text elements use the typeface Foundry Sterling.
The secondary version of the Oxford logo, the horizontal
rectangle logo, is only to be used where height (vertical
space) is restricted.
These standard versions of the Oxford logo are intended
for use on white or light-coloured backgrounds, including
light uncomplicated photographic backgrounds.
Examples of how these logos should be used for various
applications appear in the following pages.
NOTE
The minimum size for the quadrangle logo and the
rectangle logo is 24mm wide. Smaller versions with
bolder elements are available for use down to 15mm
wide. See page 7.
The Oxford logo
Quadrangle Logo
Rectangle Logo
This is the square
logo of first
choice or primary
Oxford logo.
The rectangular
secondary Oxford
logo is for use only
where height is
restricted.
Chapter 5, Part 1: The Bayesian Paradigm
Advanced Topics in Statistical Machine Learning
Tom Rainforth
Hilary 2022
rainforth@stats.ox.ac.uk

Bayesian Probability is All About Belief
Frequentist Probability
The frequentist interpretation of probability is that it is the
average proportion of the time an event will occur if a trial is
repeated infinitely many times.
Bayesian Probability
The Bayesian interpretation of probability is that it is the
subjective belief that an event will occur in the presence of
incomplete information
1

Bayesianism vs Frequentism
https://xkcd.com/1132/
2

Bayes’ Rule
p(B|A) =
p(A|B)p(B)
p(A)
3

Using Bayes’ Rule
ˆ Encode initial belief about parameters θ using a prior p(θ)
ˆ Characterize how likely different values of θ are to have given
rise to observed data D using a likelihood function p(D|θ)
ˆ Combine these to give posterior, p(θ|D), using Bayes’ rule:
p(θ|D) =
p(D|θ)p(θ)
p(D)
(1)
ˆ This represents our updated belief about θ once the
information from the data has been incorporated
ˆ Finding the posterior is known as Bayesian inference
ˆ p(D) =
R
p(D|θ)p(θ)dθ is a normalization constant known as
the marginal likelihood or model evidence
ˆ This does not depend on θ so we have
p(θ|D) ∝ p(D|θ)p(θ) (2)
4

Example: Positive COVID Test
We just got a positive COVID test, what is the probability we
actually have COVID?
Short answer: it rather depends on why we got tested and the
current prevalence of COVID
Note the numbers in this example are not remotely accurate
and only used for demonstration
5

Example: Positive COVID Test from Randomized Testing
ˆ Let θ = 1 denote the scenario where we have COVID and say
1/100 people in our area currently have COVID
ˆ If we got tested at random we might thus choose to use the
prior of p(θ = 1) = 1/100
ˆ Let’s assume the test is 95% accurate regardless of whether
we have COVID, so p(D|θ = 1) = 0.95, p(D|θ = 0) = 0.05.
Applying Bayes rule:
p(θ = 1|D) =
p(D|θ = 1)p(θ = 1)
p(D|θ = 1)p(θ = 1) + p(D|θ = 0)p(θ = 0)
=
0.95 × 0.01
0.95 × 0.01 + 0.05 × 0.99
≈ 0.16
So it seems our chances of having COVID are actually quite low!
6

Example: Positive COVID Test with Symptoms
ˆ Imagine we now instead go a test specifically because we were
showing symptoms
ˆ Let the proportion of such tests being positive be 0.3, so we
choose the prior of p(θ = 1) = 0.3
Bayes rule now yields
p(θ = 1|D) =
p(D|θ = 1)p(θ = 1)
p(D|θ = 1)p(θ = 1) + p(D|θ = 0)p(θ = 0)
=
0.95 × 0.3
0.95 × 0.3 + 0.05 × 0.7
≈ 0.89
So now it is extremely likely we have COVID!
Take home: the prior matters
7

Multiple Observations: Using the Posterior as the Prior
ˆ One of the key characteristics of Bayes’ rule is that it is
self-similar under multiple observations
ˆ We can use the posterior after our first observation as the
prior when considering the next:
p(θ|D1, D2) =
p(D2|θ, D1)p(θ|D1)
p(D2|D1)
=
p(D1, D2|θ)p(θ)
p(D1, D2)
ˆ We can thinking of this as continuous updating of beliefs as
we receive more information
8

Making Predictions
ˆ Prediction in Bayesian models is done using the posterior
predictive distribution
ˆ This is defined by taking the expectation of a predictive model
for new data, p(D∗|θ), with respect to the posterior:
p(D∗
|D) = Ep(θ|D)[p(D∗
|θ)]. (3)
ˆ Note here that we are making the standard assumption that
the data is conditionally independent given θ (can in theory
use p(D∗|θ, D) instead)
ˆ Prediction is often done dependent on an input point such
that we actually calculate p(y|x, D) = Ep(θ|D)[p(y|x, θ)]
ˆ Note that this can be very expensive: typically requires
approximations
9

Why Should we Take a Bayesian Approach?
Bayesian Reasoning is the Language of Epistemic Uncertainty
Bayesian reasoning is the basis for
how to make decisions with in-
complete information
Bayesian methods allow us to con-
struct models that return princi-
pled uncertainty estimates
[Source: https://localcovid.info/]
10

Why Should we Take a Bayesian Approach?
Bayesian Modeling Lets us Utilize Domain Expertise
Bayesian modeling allows us to com-
bine information from data with that
from prior expertise
Models make clear assumptions and
are explainable
Bayesian models are often inter-
pretable; they can be easily queried,
criticized, and built on by humans
11

Shortfalls [Non Exhaustive]
ˆ Bayesian inference is typically very difficult and expensive:
getting around the proportionality constant in Bayes rule is
surprisingly challenging
ˆ All models are approximations of the world
ˆ Constructing accurate models can be very difficult
ˆ We will always impart incorrect assumptions on our model,
particular in our likelihood function
ˆ For large datasets, the bias from these can usually be avoided
by using a powerful discriminative method
ˆ Bayesian reasoning only incorporates uncertainty that is within
our model: it does not account for unknown unknowns
ˆ This can lead to overconfidence
ˆ Our probabilities/uncertainties are always inherently subjective
ˆ Can struggle to deal with outliers in the data because
likelihood terms are multiplicative
12

Recap
ˆ Bayesian machine learning is a generative approach that
allows us to incorporate uncertainty and information from
prior expertise
ˆ Bayes’ rule: p(θ|D) ∝ p(D|θ)p(θ)
ˆ Posterior predictive: p(D∗|D) = Ep(θ|D) [p(D∗|θ)]
13

Further Reading
ˆ Additional examples in the notes
ˆ Chapter 1 of C Robert. The Bayesian choice: from
decision-theoretic foundations to computational
implementation. 2007. https://www.researchgate.net/publication/
41222434_The_Bayesian_Choice_From_Decision_Theoretic_Foundations_
to_Computational_Implementation.
ˆ Michael I Jordan. Are you a Bayesian or a frequentist? Video
lecture, 2009. http://videolectures.net/mlss09uk_jordan_bfway/
14

An introduction to Bayesian Statistics and its application

More Related Content

Similar to An introduction to Bayesian Statistics and its application (20)

Recently uploaded (20)

An introduction to Bayesian Statistics and its application