SlideShare a Scribd company logo
Variational Bayes:
A Gentle Introduction
Flavio Mejia Morelli
December 3, 2019
2/81
About me
• Flavio, from Bogot´a (Colombia)
• Education: MSc. Statistics at Humboldt University Berlin
• Work: statistical consulting unit of the Free University of
Berlin fu:stat
3/81
About this talk
• Basic knowledge of probability theory
• Basic knowledge of Bayesian statistics
• Focus on intuition, from a learner’s perspective
• However, Bayesian computation is still very technical
• Feel free to ask questions during the talk!
• For the material, check @ flaviomorelli on Twitter
4/81
Contents
1 What is variational Bayes?
2 Divergence between distributions
3 Estimating variational approximations
4 Example with PyMC3
5 Evaluating variational approximations
6 Alternative distance measures
7 Summary
5/81
Contents
1 What is variational Bayes?
2 Divergence between distributions
3 Estimating variational approximations
4 Example with PyMC3
5 Evaluating variational approximations
6 Alternative distance measures
7 Summary
6/81
Motivation: Why Variational Bayes?
• One of the main alternatives to MCMC
• Estimate different kinds of models much faster than MCMC
(usually at the expense of precision)
• However, variational inference is less well understood than
MCMC
7/81
Terminology: Bayes’ theorem
p(θ|y) =
p(θ, y)
p(y)
=
p(y|θ) · p(θ)
Θ p(y|θ) · p(θ)dθ
8/81
Terminology: Bayes’ theorem
p(θ|y) =
p(θ, y)
p(y)
=
p(y|θ) · p(θ)
Θ p(y|θ) · p(θ)dθ
• θ ∈ Θ: a parameter or vector of parameters (e.g. the
probability in a binomial distribution) in the parameter space
Θ
• y: data
• p(θ|y): posterior
• p(y|θ): likelihood function that captures how we are modeling
our data stochastically (e.g. y is a binomial variable)
• p(θ): prior knowledge about the parameters (e.g. a
probability can only be between 0 and 1)
9/81
Terminology: Bayes’ theorem
p(θ|y) =
p(θ, y)
p(y)
=
p(y|θ) · p(θ)
Θ p(y|θ) · p(θ)dθ
• p(y) = Θ p(y|θ) · p(θ)dθ: the marginal likelihood or evidence
• p(y) is a constant that normalizes the expression so that the
posterior integrates to one.
• Problem: p(y) is intractable even for very simple models,
and this makes it very hard to calculate the posterior
• Bayesian computation methods get around this intractability
by different means
10/81
MCMC and variational inference
• Markov Chain Monte Carlo (MCMC) is one of the most
common methods of estimating parameters in a Bayesian
model
• ”Little fly” takes samples from the parameter space to find
out how likely a given combination of parameters is
• Different approaches: Gibbs Sampler, Metropolis-Hastings,
Hamiltonian Monte Carlo, NUTS
• Pro: more Monte Carlo samples lead to a more accurate
estimate
• Con: slow and curse of dimensionality
11/81
MCMC and variational inference
• Variational inference turn the estimation of the posterior into
an optimization problem (i.e. maximize or minimize)
• Main idea: find another probability function that is easier to
work with than the posterior
• Minimize the difference between the new probability function
an the posterior
12/81
MCMC and variational inference
How do we measure the difference between
probability functions?
13/81
Contents
1 What is variational Bayes?
2 Divergence between distributions
3 Estimating variational approximations
4 Example with PyMC3
5 Evaluating variational approximations
6 Alternative distance measures
7 Summary
14/81
Difference between density functions
15/81
Difference between density functions
• Let p(x) and q(x) be two probability densities
• A naive approach for a given x:
q(x)
p(x)
16/81
Difference between density functions
• However, if we flip the terms we see that the absolute value of
this measure also changes: e.g. 0.3
0.1 = 3, but 0.1
0.3 = 1
3
• If we take the logarithm, the absolute value stays constant
after flipping the probabilities, only the sign changes:
log(q(x)
p(x) ) = −log(p(x)
q(x) )
17/81
Difference between density functions
• Assume that we are interested mainly on q(x)
• Use q(x) as a weight for the difference measure: q(x)log(q(x)
p(x) )
• When q(x) is low, the difference measure log(q(x)
p(x) ) does not
matter as much, as when q(x) is high!
18/81
Difference between density functions
• As a final step, we integrate over all the possible values of x
(or sum if the density is not continuous)
DKL = X q(x) log(q(x)
p(x) )dx
• DKL is called the Kullback-Leibler divergence of p from q
19/81
Kullback-Leibler Divergence
DKL(q p) = X q(x) log(q(x)
p(x) )dx
• Common measure for the divergence between two probability
densities
• The Kullback-Leibler divergence is not symmetric, and thus
is cannot be called a ”distance”: DKL(q p) = DKL(p q)
20/81
Kullback-Leibler Divergence
X q(x) log(q(x)
p(x) )dx = − X q(x) log(p(x)
q(x) )dx
• Note that if we flip the densities inside the logarithm, the
divergence does not change
• DKL ≥ 0 for any given probabilities. If DKL = 0 it means that
both probabilities are the same at each point
21/81
Examples of KL-Divergence: DKL(Q P)
22/81
Examples of KL-Divergence: : DKL(P Q)
23/81
Examples of KL-Divergence: : bimodal distribution
24/81
Examples of KL-Divergence: : bimodal distribution lower
variance
25/81
Variational approximation of the posterior
• Find a q∗(θ) which minimizes the KL divergence between q(θ)
and the posterior p(θ|y):
DKL(q p) =
Θ
q(θ) log(
q(θ)
p(θ|y)
)
• However, in order to calculate the KL divergence we would
have to know the posterior p(θ|y) which is intractable.
• We are back to square one...
26/81
Side note: What does ”variational” mean?
• The term ”variational” comes from variational calculus
• One of the main topics of calculus is optimization
• Optimization is usually done with respect to a variable.
• A common problem in economic is finding an optimum
quantity Q∗ that maximizes profit given a demand and a cost
function
• In contrast, variational calculus optimizes with respect to a
function. In our case, we are trying to find a q∗(θ) which
minimizes the Kullback-Leibler divergence
27/81
Side note: KL-divergence as expected value
• Some papers and textbooks write the KL divergence as an
expected value with respect to q
• Because we are weighting by q(θ), we can express it as an
expected value
DKL(q p) =
Θ
q(θ) log
q(θ)
p(θ|y)
= Eq log
q(θ)
p(θ|y)
28/81
So, what now?
Find an alternative way to optimize the
divergence!
29/81
ELBO: evidence lower bound
• It can be shown that:
log p(y) =
Θ
q(θ) log
p(y, θ)
q(θ)
L
+
Θ
q(θ) log
q(θ)
p(θ|y)
DKL(q p)
• DKL(q p) is the Kullback-Leibler divergence
• L is the evidence lower bound
• The Kullback-Leibler divergence is intractable, because it
contains the posterior
• On the other hand, it is possible to compute all the terms in
L, as p(y, θ) = p(θ)p(y|θ) is known
30/81
ELBO: derivation
log p(y) = log p(y) · 1 = log p(y)
Θ
q(θ)dθ
=
Θ
q(θ) log p(y)dθ =
Θ
q(θ) log
p(y, θ)
p(θ|y)
dθ
=
Θ
q(θ) log
p(y, θ)
p(θ|y)
·
q(θ)
q(θ)
dθ
=
Θ
q(θ) log
p(y, θ)
q(θ)
+
Θ
q(θ) log
q(θ)
p(θ|y)
Which can be written as (Ormerod & Wand, 2010):
log p(y) = L + DKL(q p)
31/81
ELBO: importance and optimization
• By rearranging we get:
L = log p(y) − DKL(q p)
• As DKL ≥ 0 ⇒ log p(y) ≥ L
• Therefore, L is the lower bound of the logarithm of the
evidence p(y)
• Hence the name evidence lower bound or ELBO for L
32/81
ELBO: importance and optimization
L = log p(y) − DKL(q p)
• Key idea: Maximizing the ELBO is equivalent to minimizing
the Kullback-Leibler divergence
• The idea of maximizing the ELBO is the basis of most
variational inference approaches
33/81
Contents
1 What is variational Bayes?
2 Divergence between distributions
3 Estimating variational approximations
4 Example with PyMC3
5 Evaluating variational approximations
6 Alternative distance measures
7 Summary
34/81
Making the ELBO tractable
• In theory, we could take any distribution q we like to
approximate the posterior
• However, there are usually restrictions on q to make the
problem more tractable
35/81
Making the ELBO tractable
• The most common restrictions are (Ormerod & Wand, 2010):
• Mean-field assumption: q(θ) factorizes into
M
i=1 qi (θi ) for
some partition {θ1, ..., θM }
• q comes from a parametric family of density functions
36/81
Mean-field assumption
q(θ) = M
i=1 qi (θi )
• E.g. normal distribution: p
µ
σ
= p(µ)p(σ)
• Note that the quality of the approximation depends on the
dependence between θi
• Example: given p(θ1, θ2|y), if θ1 and θ2 are highly dependent,
then the restriction q(θ1, θ2) = q1(θ1)q2(θ2) will cause a
degradation in the inference
37/81
Mean-field assumption
• The graphic shows a mean-field approximation of a
two-dimensional Gaussian posterior (Blei, Kucukelbir, & McAuliffe,
2017)
• Both distribution share the same mean, but the covariance
structure is different
38/81
Coordinate ascent variational inference: algorithm
1 Initialize q∗
2(θ2), ..., q∗
M(θM)
2 Cycle
q∗
1(θ1) =
exp{ E−θ1 log p(y, θ)}
exp{E−θ1 log p(y, θ)}dθ1
.
.
.
q∗
M(θM) =
exp{ E−θM
log p(y, θ)}
exp{E−θM
log p(y, θ)}dθM
Until the increase in the ELBO is negligible (Ormerod & Wand,
2010)
39/81
Coordinate ascent variational inference: key takeaways 1
• The algorithm is very inefficient, since is it use the complete
data set for each time it updates q∗
i (θi ), and then repeats
until ELBO converges
• This means CAVI cannot handle massive data
40/81
Coordinate ascent variational inference: key takeaways 2
• The ELBO or the optimal solution for q∗
i have to be
calculated by hand for each model
• There are good examples of the mathematical derivation
(with Python implementation) for a univariate Gaussian model
(Sia, 2019)
• Another example for the mathematical derivation of the
ELBO can be found in the original paper on Latent Dirichtlet
Allocation, a topic model algorithm (Blei, Ng, & Jordan, 2003)
41/81
Coordinate ascent variational inference: key takeaways 3
• This approach only works for conditionally conjugate
models like multilevel regression (e.g. linear, probit, Poisson),
matrix factorization (e.g. factor analysis, PCA),
mixed-membership models (e.g. LDA), etc. (Blei, 2017)
• There is a relation between the CAVI and Gibbs-Sampling
(Ormerod & Wand, 2010)
42/81
Stochastic Variational approximation (SVI)
• Based on stochastic approximation (Robbins & Monro, 1951)
• On each iteration: take a subsample of the data, infer the
local structure, update the global structure, repeat
• The gradient is noisy, but it is still doing the coordinate
ascent on the ELBO
• More efficient than CAVI, and more accurate results (Blei,
2017)
43/81
Stochastic Variational approximation (SVI): disadvantages
• Still has to be derived analytically
• Still limited to the same types of model that worked with CAVI
44/81
Black box variational inference: motivation
• CAVI and SVI problems:
• Problem # 1: Time-consuming For each model, you have
to derive the formulas analytically, and then come up with an
algorithm to calculate the model
• Problem # 2: Lack of generalization What happens to
models that are not conditionally conjugate? (Bayesian logistic
regression cannot be easily estimated with CAVI or SVI!)
45/81
Black box variational inference: goal
• Goal: Use variational inference with any model
• No mathematical work besides specifying the model
• Most Bayesian packages and languages (Stan, PyMC3,...) use
this kind of variational inference to estimate the model (e.g.
ADVI)
46/81
Black box variational inference: how does it work?
• Noisy gradient of the ELBO, λ is a hyperparameter (Ranganath,
Gerrish, & Blei, 2014):
λL = Eq[( λlog q(θ|λ))(log p(y, θ) − log q(θ|λ))]
• q(θ|λ) is from the variational families and can be stored in a
library
• p(y, θ) = p(θ)p(y|θ) is equivalent to writing down the model
• Black box criteria:
• Sample from q(θ|λ) (Monte Carlo)
• Evaluate λlog q(θ|λ)
• Evaluate log p(y, θ)
47/81
Black box variational inference: key takeaways
• q(θlλ) (variational approximation) and p(y, θ) (our model)
are known
• Only this two components are needed to calculate the
gradient, and maximize the ELBO
• No time-consuming analytical derivation needed
• Makes it possible to estimate a wider range of models e.g.
GLM, nonlinear time series, volatility models (e.g. ARCH,
GARCH), Bayesian neural networks (Blei, 2017)
48/81
Contents
1 What is variational Bayes?
2 Divergence between distributions
3 Estimating variational approximations
4 Example with PyMC3
5 Evaluating variational approximations
6 Alternative distance measures
7 Summary
49/81
Model: univariate linear regression
y = α + βx
y ∼ N(α + βx, σ)
α ∼ G−1(1, 1)
β ∼ N(0, 2)
σ ∼ G−1(1, 1)
• Note that the normal distribution is parametrized with the
standard deviation
• This examples is based on (Ioannides, 2018)
50/81
Model: true parameters
• The data is generated by simulation
• True parameters:
α = 1
β = 1
σ = 0.75
51/81
Model: simulated data
52/81
Model: code
53/81
Model: estimation with NUTS
54/81
Model: estimation with ADVI
• It is also possible to use minibatches (PyMC3 documentation)
• Use ”fullrank advi” to keep the correlation structure
55/81
Model: estimation with NUTS and ADVI
NUTS
ADVI
• Full rank estimates similar to the ADVI, but with a higher std. dev.
56/81
Model: estimation with NUTS
57/81
Model: estimation with ADVI
58/81
Model: maximizing the ELBO
59/81
Correlation in MCMC
• In a linear regression, the fitted line goes through y and x
• The intercept and the slope are negatively correlated
• A higher slope means that the intercept has to adjust
downwards
60/81
Correlation in ADVI (mean-field)
• With ADVI, there is no correlation due to the mean-field
assumption
61/81
Correlation in Full Rank ADVI
• Full rank ADVI takes into account the covariance structure
of the model
• However, it can take longer to estimate
62/81
Contents
1 What is variational Bayes?
2 Divergence between distributions
3 Estimating variational approximations
4 Example with PyMC3
5 Evaluating variational approximations
6 Alternative distance measures
7 Summary
63/81
Evaluating variational inference
• It can be challenging to discover problems with posterior
approximation (Yao, Vehtari, Simpson, & Gelman, 2018)
• Variational inference approaches come with few theoretical
guarantees
• It is difficult to assess the quality of the approximation
64/81
Evaluating variational inference
• Usual problems (Blei et al., 2017):
• Slow ELBO convergence
• The inability of the approximation family to capture the true
posterior
• The mean-field assumption makes it difficult to calculate
posterior correlation
• The asymmetry of the true distribution
• KL divergence under-penalizes approximations with too light
tails
65/81
Diagnostics in common Bayesian frameworks
• Both Stan and PyMC3 use ADVI (Kucukelbir, Tran, Ranganath,
Gelman, & Blei, 2017), which is related to Black Box variational
inference
• Both frameworks have convergence diagnostics (in Stan
there is still an open pull request to impolement all the
diagnostics from the Yao et al. paper)
66/81
Contents
1 What is variational Bayes?
2 Divergence between distributions
3 Estimating variational approximations
4 Example with PyMC3
5 Evaluating variational approximations
6 Alternative distance measures
7 Summary
67/81
Issues with KL-divergence
• The KL divergence punishes placing mass in q where p has
little mass, but it punishes less the reverse
• The mean-field assumption can lead to underestimate the
posterior variance, and to miss important modes (Blei et al.,
2017).
• However, there is fullrank ADVI which helps capture
parameter correlations (e.g. in PyMC3, Stan)
• There are alternative divergence measures (e.g.
α-divergence), but their use is still an open question
68/81
Alternatives divergence measures: Expectation
maximization
• Why don’t we minimize DKL(p q), which would be more
intuitive?
DKL(p q) =
Θ
p(θ|y) log(
p(θ|y)
q(θ)
)
• This alternative is also feasible and it is called expectation
maximization
• However, it is less efficient than minimizing DKL(q p) (Blei et
al., 2017)
69/81
Alternatives divergence measures: α-divergence and tail
adaptive f -divergence
• α-divergence: typically larger values of α enforce stronger
mass-covering, i.e. approximation q covers more modes of p
(Minka, 2005)
• Tail adaptive f -divergence: offers a solution to the problem
that the α-divergence might cause high or infinite variance
when the distributions have fat tails (Wang, Liu, & Liu, 2018)
70/81
Alternatives divergence measures in practice
• Alternative divergences can lead to better approximations, but
can be more difficult to estimate
• The practical use of alternative divergence measures in
common Bayesian frameworks like Stan and PyMC3 is limited
• The KL-divergence (i.e. ELBO) is still a very convenient
measure to calculate
71/81
Alternative divergence measures: α-divergence
• Also called R´enyi divergence, it is a generalization of the
Kullback-Leibler divergence
• Dα(Q P) =
1
α − 1
log
Θ
q(θ)α
p(θ)α−1
dθ , for 0 < α < ∞
and α = 1
• Special cases:
• limα→1 Dα(Q P) = DKL(Q P)
• limα→0 Dα(Q P) = DKL(P Q)
• For α > 0, larger values of α enforce stronger mass-covering
properites
• In practice, large values of α lead to a very high variance if
the distribution has fat tails (Wang et al., 2018)
72/81
Alternative divergence measures: f -divergence
• Generalization for DKL and Dα
• Df (Q P) =
Θ
f
q(θ)
p(θ)
p(θ)dη(θ), where η(θ) is a
reference distribution over the space Θ and f is a convex
function (Sason, 2018)
• If f (t) = t log t, then Df (Q P) = DKL(Q P)
• f can also be chosen so that Df (Q P) = Dα(Q P)
• Tail-adaptive f -divergences can help solve the problems of
α-divergences
73/81
Contents
1 What is variational Bayes?
2 Divergence between distributions
3 Estimating variational approximations
4 Example with PyMC3
5 Evaluating variational approximations
6 Alternative distance measures
7 Summary
74/81
Summary
• The idea of variational inference is to find a probability
distribution that minimizes the divergence to the posterior
• The KL-divergence cannot be minimized directly, as it
depends on the intractable posterior
• Maximizing the ELBO is equivalent to minimizing the
KL-divergence
75/81
Summary
• To maximize the ELBO, we have to make assumptions
• The most common assumption is the mean-field
assumption, which treats parameters as independent
• An easy, but inefficient, algorithm to maximize the ELBO is
CAVI
• By taking a subsample and then updating the data structure,
CAVI is more efficient and precise (SVI)
76/81
Summary
• With CAVI and SVI, the model families we can use are
restricted and the ELBO has to be derived by hand
• Black-box variational inference solves these two problems
• PyMC3 and Stan use ADVI which is based on BBVI. They
provide mean-field and full rank alternatives
• There are alternatives to the KL-divergence, but are not
widely used in common frameworks
77/81
Summary
What do I really need to know about
variational inference?
78/81
Twitter: @ flaviomorelli
GitHub: @flaviomorelli
LinkedIn: Flavio Morelli
Email: flavio.morelli@fu-berlin.de
79/81
Bibliography I
Blei, D. M. (2017). Variational Inference: Foundations and
Innovations. Retrieved 2019-11-30, from
https://youtu.be/Dv86zdWjJKQ
Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational
Inference: A Review for Statisticians. Journal of the
American Statistical Association, 112(518), 859–877.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet
Allocation. Journal of Machine Learning Research, 3,
993–1022.
Ioannides, A. (2018). Regression in PYMC3 using MCMC &
VariationalInference. Retrieved 2019-11-25, from https://
alexioannides.com/2018/11/07/bayesian-regression
-in-pymc3-using-mcmc-variational-inference
80/81
Bibliography II
Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei,
D. M. (2017). Automatic Differentiation Variational
Inference. Journal of Machine Learning Research, 18(14),
1–45.
Minka, T. (2005). Divergence measures and message passing
(Tech. Rep. No. MSR-TR-2005-173). Microsoft Research.
Ormerod, J. T., & Wand, M. P. (2010). Explaining Variational
Approximations. The American Statistician, 64(2), 140–153.
Ranganath, R., Gerrish, S., & Blei, D. M. (2014). Black Box
Variational Inference. In Proceedings of Artificial Intelligence
and Statistics.
Robbins, H., & Monro, S. (1951). A Stochastic Approximation
Method. The Annals of Mathematical Statistics, 22(3),
400–407.
Sason, I. (2018). On f-Divergences: Integral Representations,
Local Behavior, and Inequalities. Entropy, 20(5), 383.
81/81
Bibliography III
Sia, X. Y. S. (2019). Coordinate Ascent Mean-field Variational
Inference (Univariate Gaussian Example). Retrieved
2019-11-20, from https://suzyahyah.github.io/
bayesian%20inference/machine%20learning/
variational%20inference/2019/03/20/CAVI.html
Wang, D., Liu, H., & Liu, Q. (2018). Variational Inference with
Tail-adaptive f-Divergence. In Advances in Neural
Information Processing Systems.
Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (2018). Yes, but
Did It Work?: Evaluating Variational Inference. In
Proceedings of the 35th International Conference on
Machine Learning.

More Related Content

PPT
Variational Inference
PPTX
random forest regression
PDF
Monte Carlo Statistical Methods
PPTX
Introduction to XGboost
PDF
Latent Dirichlet Allocation
PDF
Linear discriminant analysis
PDF
Graphic Notes on Introduction to Linear Algebra
PDF
Maximum Likelihood Estimation
Variational Inference
random forest regression
Monte Carlo Statistical Methods
Introduction to XGboost
Latent Dirichlet Allocation
Linear discriminant analysis
Graphic Notes on Introduction to Linear Algebra
Maximum Likelihood Estimation

What's hot (20)

PPTX
Multinomial distribution
PPT
Splay Tree Algorithm
PPTX
RABIN KARP ALGORITHM STRING MATCHING
PPT
Intermediate code generation
PPTX
Discrete mathematic
PPTX
Pca(principal components analysis)
PDF
27 NP Completness
PDF
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
PPTX
K-Nearest Neighbor(KNN)
PDF
Topological Data Analysis: visual presentation of multidimensional data sets
PDF
Linear models for classification
PPTX
Naive Bayes Presentation
PPTX
Language Model (N-Gram).pptx
PPTX
Theory of first order logic
PPT
Topic Models
PPTX
Presentasi KNN
PDF
Gentle Introduction to Dirichlet Processes
PDF
Principal component analysis and lda
PDF
Gpt models
Multinomial distribution
Splay Tree Algorithm
RABIN KARP ALGORITHM STRING MATCHING
Intermediate code generation
Discrete mathematic
Pca(principal components analysis)
27 NP Completness
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
K-Nearest Neighbor(KNN)
Topological Data Analysis: visual presentation of multidimensional data sets
Linear models for classification
Naive Bayes Presentation
Language Model (N-Gram).pptx
Theory of first order logic
Topic Models
Presentasi KNN
Gentle Introduction to Dirichlet Processes
Principal component analysis and lda
Gpt models
Ad

Similar to Variational Bayes: A Gentle Introduction (20)

PDF
Machine learning mathematicals.pdf
PDF
Foundation of KL Divergence
PDF
Variational inference
PDF
Can we estimate a constant?
PPTX
P, NP and NP-Complete, Theory of NP-Completeness V2
PDF
NCE, GANs & VAEs (and maybe BAC)
PDF
Dynamic Programming From CS 6515(Fibonacci, LIS, LCS))
PDF
Bayesian Deep Learning
PDF
PDF
Firefly exact MCMC for Big Data
PDF
HMC and NUTS
PDF
Unbiased MCMC with couplings
PDF
no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm
PDF
CDT 22 slides.pdf
PDF
Chapter 1 asdawdawdawdwadawdaw sf sdaipkiof 0iuase f9iu80awehf úoaeh
PDF
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...
PPT
Lecture 01.ppt
PPT
Introduction to Logic Spring 2007 Introduction to Discrete Structures.ppt
PDF
Algorithms Lab PPT
PPT
Discrete Mathematics - All chapters
Machine learning mathematicals.pdf
Foundation of KL Divergence
Variational inference
Can we estimate a constant?
P, NP and NP-Complete, Theory of NP-Completeness V2
NCE, GANs & VAEs (and maybe BAC)
Dynamic Programming From CS 6515(Fibonacci, LIS, LCS))
Bayesian Deep Learning
Firefly exact MCMC for Big Data
HMC and NUTS
Unbiased MCMC with couplings
no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm
CDT 22 slides.pdf
Chapter 1 asdawdawdawdwadawdaw sf sdaipkiof 0iuase f9iu80awehf úoaeh
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...
Lecture 01.ppt
Introduction to Logic Spring 2007 Introduction to Discrete Structures.ppt
Algorithms Lab PPT
Discrete Mathematics - All chapters
Ad

Recently uploaded (20)

PDF
Fluorescence-microscope_Botany_detailed content
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Database Infoormation System (DBIS).pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
Fluorescence-microscope_Botany_detailed content
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Business Acumen Training GuidePresentation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
1_Introduction to advance data techniques.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to machine learning and Linear Models
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
ISS -ESG Data flows What is ESG and HowHow
Database Infoormation System (DBIS).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Supervised vs unsupervised machine learning algorithms

Variational Bayes: A Gentle Introduction

  • 1. Variational Bayes: A Gentle Introduction Flavio Mejia Morelli December 3, 2019
  • 2. 2/81 About me • Flavio, from Bogot´a (Colombia) • Education: MSc. Statistics at Humboldt University Berlin • Work: statistical consulting unit of the Free University of Berlin fu:stat
  • 3. 3/81 About this talk • Basic knowledge of probability theory • Basic knowledge of Bayesian statistics • Focus on intuition, from a learner’s perspective • However, Bayesian computation is still very technical • Feel free to ask questions during the talk! • For the material, check @ flaviomorelli on Twitter
  • 4. 4/81 Contents 1 What is variational Bayes? 2 Divergence between distributions 3 Estimating variational approximations 4 Example with PyMC3 5 Evaluating variational approximations 6 Alternative distance measures 7 Summary
  • 5. 5/81 Contents 1 What is variational Bayes? 2 Divergence between distributions 3 Estimating variational approximations 4 Example with PyMC3 5 Evaluating variational approximations 6 Alternative distance measures 7 Summary
  • 6. 6/81 Motivation: Why Variational Bayes? • One of the main alternatives to MCMC • Estimate different kinds of models much faster than MCMC (usually at the expense of precision) • However, variational inference is less well understood than MCMC
  • 7. 7/81 Terminology: Bayes’ theorem p(θ|y) = p(θ, y) p(y) = p(y|θ) · p(θ) Θ p(y|θ) · p(θ)dθ
  • 8. 8/81 Terminology: Bayes’ theorem p(θ|y) = p(θ, y) p(y) = p(y|θ) · p(θ) Θ p(y|θ) · p(θ)dθ • θ ∈ Θ: a parameter or vector of parameters (e.g. the probability in a binomial distribution) in the parameter space Θ • y: data • p(θ|y): posterior • p(y|θ): likelihood function that captures how we are modeling our data stochastically (e.g. y is a binomial variable) • p(θ): prior knowledge about the parameters (e.g. a probability can only be between 0 and 1)
  • 9. 9/81 Terminology: Bayes’ theorem p(θ|y) = p(θ, y) p(y) = p(y|θ) · p(θ) Θ p(y|θ) · p(θ)dθ • p(y) = Θ p(y|θ) · p(θ)dθ: the marginal likelihood or evidence • p(y) is a constant that normalizes the expression so that the posterior integrates to one. • Problem: p(y) is intractable even for very simple models, and this makes it very hard to calculate the posterior • Bayesian computation methods get around this intractability by different means
  • 10. 10/81 MCMC and variational inference • Markov Chain Monte Carlo (MCMC) is one of the most common methods of estimating parameters in a Bayesian model • ”Little fly” takes samples from the parameter space to find out how likely a given combination of parameters is • Different approaches: Gibbs Sampler, Metropolis-Hastings, Hamiltonian Monte Carlo, NUTS • Pro: more Monte Carlo samples lead to a more accurate estimate • Con: slow and curse of dimensionality
  • 11. 11/81 MCMC and variational inference • Variational inference turn the estimation of the posterior into an optimization problem (i.e. maximize or minimize) • Main idea: find another probability function that is easier to work with than the posterior • Minimize the difference between the new probability function an the posterior
  • 12. 12/81 MCMC and variational inference How do we measure the difference between probability functions?
  • 13. 13/81 Contents 1 What is variational Bayes? 2 Divergence between distributions 3 Estimating variational approximations 4 Example with PyMC3 5 Evaluating variational approximations 6 Alternative distance measures 7 Summary
  • 15. 15/81 Difference between density functions • Let p(x) and q(x) be two probability densities • A naive approach for a given x: q(x) p(x)
  • 16. 16/81 Difference between density functions • However, if we flip the terms we see that the absolute value of this measure also changes: e.g. 0.3 0.1 = 3, but 0.1 0.3 = 1 3 • If we take the logarithm, the absolute value stays constant after flipping the probabilities, only the sign changes: log(q(x) p(x) ) = −log(p(x) q(x) )
  • 17. 17/81 Difference between density functions • Assume that we are interested mainly on q(x) • Use q(x) as a weight for the difference measure: q(x)log(q(x) p(x) ) • When q(x) is low, the difference measure log(q(x) p(x) ) does not matter as much, as when q(x) is high!
  • 18. 18/81 Difference between density functions • As a final step, we integrate over all the possible values of x (or sum if the density is not continuous) DKL = X q(x) log(q(x) p(x) )dx • DKL is called the Kullback-Leibler divergence of p from q
  • 19. 19/81 Kullback-Leibler Divergence DKL(q p) = X q(x) log(q(x) p(x) )dx • Common measure for the divergence between two probability densities • The Kullback-Leibler divergence is not symmetric, and thus is cannot be called a ”distance”: DKL(q p) = DKL(p q)
  • 20. 20/81 Kullback-Leibler Divergence X q(x) log(q(x) p(x) )dx = − X q(x) log(p(x) q(x) )dx • Note that if we flip the densities inside the logarithm, the divergence does not change • DKL ≥ 0 for any given probabilities. If DKL = 0 it means that both probabilities are the same at each point
  • 23. 23/81 Examples of KL-Divergence: : bimodal distribution
  • 24. 24/81 Examples of KL-Divergence: : bimodal distribution lower variance
  • 25. 25/81 Variational approximation of the posterior • Find a q∗(θ) which minimizes the KL divergence between q(θ) and the posterior p(θ|y): DKL(q p) = Θ q(θ) log( q(θ) p(θ|y) ) • However, in order to calculate the KL divergence we would have to know the posterior p(θ|y) which is intractable. • We are back to square one...
  • 26. 26/81 Side note: What does ”variational” mean? • The term ”variational” comes from variational calculus • One of the main topics of calculus is optimization • Optimization is usually done with respect to a variable. • A common problem in economic is finding an optimum quantity Q∗ that maximizes profit given a demand and a cost function • In contrast, variational calculus optimizes with respect to a function. In our case, we are trying to find a q∗(θ) which minimizes the Kullback-Leibler divergence
  • 27. 27/81 Side note: KL-divergence as expected value • Some papers and textbooks write the KL divergence as an expected value with respect to q • Because we are weighting by q(θ), we can express it as an expected value DKL(q p) = Θ q(θ) log q(θ) p(θ|y) = Eq log q(θ) p(θ|y)
  • 28. 28/81 So, what now? Find an alternative way to optimize the divergence!
  • 29. 29/81 ELBO: evidence lower bound • It can be shown that: log p(y) = Θ q(θ) log p(y, θ) q(θ) L + Θ q(θ) log q(θ) p(θ|y) DKL(q p) • DKL(q p) is the Kullback-Leibler divergence • L is the evidence lower bound • The Kullback-Leibler divergence is intractable, because it contains the posterior • On the other hand, it is possible to compute all the terms in L, as p(y, θ) = p(θ)p(y|θ) is known
  • 30. 30/81 ELBO: derivation log p(y) = log p(y) · 1 = log p(y) Θ q(θ)dθ = Θ q(θ) log p(y)dθ = Θ q(θ) log p(y, θ) p(θ|y) dθ = Θ q(θ) log p(y, θ) p(θ|y) · q(θ) q(θ) dθ = Θ q(θ) log p(y, θ) q(θ) + Θ q(θ) log q(θ) p(θ|y) Which can be written as (Ormerod & Wand, 2010): log p(y) = L + DKL(q p)
  • 31. 31/81 ELBO: importance and optimization • By rearranging we get: L = log p(y) − DKL(q p) • As DKL ≥ 0 ⇒ log p(y) ≥ L • Therefore, L is the lower bound of the logarithm of the evidence p(y) • Hence the name evidence lower bound or ELBO for L
  • 32. 32/81 ELBO: importance and optimization L = log p(y) − DKL(q p) • Key idea: Maximizing the ELBO is equivalent to minimizing the Kullback-Leibler divergence • The idea of maximizing the ELBO is the basis of most variational inference approaches
  • 33. 33/81 Contents 1 What is variational Bayes? 2 Divergence between distributions 3 Estimating variational approximations 4 Example with PyMC3 5 Evaluating variational approximations 6 Alternative distance measures 7 Summary
  • 34. 34/81 Making the ELBO tractable • In theory, we could take any distribution q we like to approximate the posterior • However, there are usually restrictions on q to make the problem more tractable
  • 35. 35/81 Making the ELBO tractable • The most common restrictions are (Ormerod & Wand, 2010): • Mean-field assumption: q(θ) factorizes into M i=1 qi (θi ) for some partition {θ1, ..., θM } • q comes from a parametric family of density functions
  • 36. 36/81 Mean-field assumption q(θ) = M i=1 qi (θi ) • E.g. normal distribution: p µ σ = p(µ)p(σ) • Note that the quality of the approximation depends on the dependence between θi • Example: given p(θ1, θ2|y), if θ1 and θ2 are highly dependent, then the restriction q(θ1, θ2) = q1(θ1)q2(θ2) will cause a degradation in the inference
  • 37. 37/81 Mean-field assumption • The graphic shows a mean-field approximation of a two-dimensional Gaussian posterior (Blei, Kucukelbir, & McAuliffe, 2017) • Both distribution share the same mean, but the covariance structure is different
  • 38. 38/81 Coordinate ascent variational inference: algorithm 1 Initialize q∗ 2(θ2), ..., q∗ M(θM) 2 Cycle q∗ 1(θ1) = exp{ E−θ1 log p(y, θ)} exp{E−θ1 log p(y, θ)}dθ1 . . . q∗ M(θM) = exp{ E−θM log p(y, θ)} exp{E−θM log p(y, θ)}dθM Until the increase in the ELBO is negligible (Ormerod & Wand, 2010)
  • 39. 39/81 Coordinate ascent variational inference: key takeaways 1 • The algorithm is very inefficient, since is it use the complete data set for each time it updates q∗ i (θi ), and then repeats until ELBO converges • This means CAVI cannot handle massive data
  • 40. 40/81 Coordinate ascent variational inference: key takeaways 2 • The ELBO or the optimal solution for q∗ i have to be calculated by hand for each model • There are good examples of the mathematical derivation (with Python implementation) for a univariate Gaussian model (Sia, 2019) • Another example for the mathematical derivation of the ELBO can be found in the original paper on Latent Dirichtlet Allocation, a topic model algorithm (Blei, Ng, & Jordan, 2003)
  • 41. 41/81 Coordinate ascent variational inference: key takeaways 3 • This approach only works for conditionally conjugate models like multilevel regression (e.g. linear, probit, Poisson), matrix factorization (e.g. factor analysis, PCA), mixed-membership models (e.g. LDA), etc. (Blei, 2017) • There is a relation between the CAVI and Gibbs-Sampling (Ormerod & Wand, 2010)
  • 42. 42/81 Stochastic Variational approximation (SVI) • Based on stochastic approximation (Robbins & Monro, 1951) • On each iteration: take a subsample of the data, infer the local structure, update the global structure, repeat • The gradient is noisy, but it is still doing the coordinate ascent on the ELBO • More efficient than CAVI, and more accurate results (Blei, 2017)
  • 43. 43/81 Stochastic Variational approximation (SVI): disadvantages • Still has to be derived analytically • Still limited to the same types of model that worked with CAVI
  • 44. 44/81 Black box variational inference: motivation • CAVI and SVI problems: • Problem # 1: Time-consuming For each model, you have to derive the formulas analytically, and then come up with an algorithm to calculate the model • Problem # 2: Lack of generalization What happens to models that are not conditionally conjugate? (Bayesian logistic regression cannot be easily estimated with CAVI or SVI!)
  • 45. 45/81 Black box variational inference: goal • Goal: Use variational inference with any model • No mathematical work besides specifying the model • Most Bayesian packages and languages (Stan, PyMC3,...) use this kind of variational inference to estimate the model (e.g. ADVI)
  • 46. 46/81 Black box variational inference: how does it work? • Noisy gradient of the ELBO, λ is a hyperparameter (Ranganath, Gerrish, & Blei, 2014): λL = Eq[( λlog q(θ|λ))(log p(y, θ) − log q(θ|λ))] • q(θ|λ) is from the variational families and can be stored in a library • p(y, θ) = p(θ)p(y|θ) is equivalent to writing down the model • Black box criteria: • Sample from q(θ|λ) (Monte Carlo) • Evaluate λlog q(θ|λ) • Evaluate log p(y, θ)
  • 47. 47/81 Black box variational inference: key takeaways • q(θlλ) (variational approximation) and p(y, θ) (our model) are known • Only this two components are needed to calculate the gradient, and maximize the ELBO • No time-consuming analytical derivation needed • Makes it possible to estimate a wider range of models e.g. GLM, nonlinear time series, volatility models (e.g. ARCH, GARCH), Bayesian neural networks (Blei, 2017)
  • 48. 48/81 Contents 1 What is variational Bayes? 2 Divergence between distributions 3 Estimating variational approximations 4 Example with PyMC3 5 Evaluating variational approximations 6 Alternative distance measures 7 Summary
  • 49. 49/81 Model: univariate linear regression y = α + βx y ∼ N(α + βx, σ) α ∼ G−1(1, 1) β ∼ N(0, 2) σ ∼ G−1(1, 1) • Note that the normal distribution is parametrized with the standard deviation • This examples is based on (Ioannides, 2018)
  • 50. 50/81 Model: true parameters • The data is generated by simulation • True parameters: α = 1 β = 1 σ = 0.75
  • 54. 54/81 Model: estimation with ADVI • It is also possible to use minibatches (PyMC3 documentation) • Use ”fullrank advi” to keep the correlation structure
  • 55. 55/81 Model: estimation with NUTS and ADVI NUTS ADVI • Full rank estimates similar to the ADVI, but with a higher std. dev.
  • 59. 59/81 Correlation in MCMC • In a linear regression, the fitted line goes through y and x • The intercept and the slope are negatively correlated • A higher slope means that the intercept has to adjust downwards
  • 60. 60/81 Correlation in ADVI (mean-field) • With ADVI, there is no correlation due to the mean-field assumption
  • 61. 61/81 Correlation in Full Rank ADVI • Full rank ADVI takes into account the covariance structure of the model • However, it can take longer to estimate
  • 62. 62/81 Contents 1 What is variational Bayes? 2 Divergence between distributions 3 Estimating variational approximations 4 Example with PyMC3 5 Evaluating variational approximations 6 Alternative distance measures 7 Summary
  • 63. 63/81 Evaluating variational inference • It can be challenging to discover problems with posterior approximation (Yao, Vehtari, Simpson, & Gelman, 2018) • Variational inference approaches come with few theoretical guarantees • It is difficult to assess the quality of the approximation
  • 64. 64/81 Evaluating variational inference • Usual problems (Blei et al., 2017): • Slow ELBO convergence • The inability of the approximation family to capture the true posterior • The mean-field assumption makes it difficult to calculate posterior correlation • The asymmetry of the true distribution • KL divergence under-penalizes approximations with too light tails
  • 65. 65/81 Diagnostics in common Bayesian frameworks • Both Stan and PyMC3 use ADVI (Kucukelbir, Tran, Ranganath, Gelman, & Blei, 2017), which is related to Black Box variational inference • Both frameworks have convergence diagnostics (in Stan there is still an open pull request to impolement all the diagnostics from the Yao et al. paper)
  • 66. 66/81 Contents 1 What is variational Bayes? 2 Divergence between distributions 3 Estimating variational approximations 4 Example with PyMC3 5 Evaluating variational approximations 6 Alternative distance measures 7 Summary
  • 67. 67/81 Issues with KL-divergence • The KL divergence punishes placing mass in q where p has little mass, but it punishes less the reverse • The mean-field assumption can lead to underestimate the posterior variance, and to miss important modes (Blei et al., 2017). • However, there is fullrank ADVI which helps capture parameter correlations (e.g. in PyMC3, Stan) • There are alternative divergence measures (e.g. α-divergence), but their use is still an open question
  • 68. 68/81 Alternatives divergence measures: Expectation maximization • Why don’t we minimize DKL(p q), which would be more intuitive? DKL(p q) = Θ p(θ|y) log( p(θ|y) q(θ) ) • This alternative is also feasible and it is called expectation maximization • However, it is less efficient than minimizing DKL(q p) (Blei et al., 2017)
  • 69. 69/81 Alternatives divergence measures: α-divergence and tail adaptive f -divergence • α-divergence: typically larger values of α enforce stronger mass-covering, i.e. approximation q covers more modes of p (Minka, 2005) • Tail adaptive f -divergence: offers a solution to the problem that the α-divergence might cause high or infinite variance when the distributions have fat tails (Wang, Liu, & Liu, 2018)
  • 70. 70/81 Alternatives divergence measures in practice • Alternative divergences can lead to better approximations, but can be more difficult to estimate • The practical use of alternative divergence measures in common Bayesian frameworks like Stan and PyMC3 is limited • The KL-divergence (i.e. ELBO) is still a very convenient measure to calculate
  • 71. 71/81 Alternative divergence measures: α-divergence • Also called R´enyi divergence, it is a generalization of the Kullback-Leibler divergence • Dα(Q P) = 1 α − 1 log Θ q(θ)α p(θ)α−1 dθ , for 0 < α < ∞ and α = 1 • Special cases: • limα→1 Dα(Q P) = DKL(Q P) • limα→0 Dα(Q P) = DKL(P Q) • For α > 0, larger values of α enforce stronger mass-covering properites • In practice, large values of α lead to a very high variance if the distribution has fat tails (Wang et al., 2018)
  • 72. 72/81 Alternative divergence measures: f -divergence • Generalization for DKL and Dα • Df (Q P) = Θ f q(θ) p(θ) p(θ)dη(θ), where η(θ) is a reference distribution over the space Θ and f is a convex function (Sason, 2018) • If f (t) = t log t, then Df (Q P) = DKL(Q P) • f can also be chosen so that Df (Q P) = Dα(Q P) • Tail-adaptive f -divergences can help solve the problems of α-divergences
  • 73. 73/81 Contents 1 What is variational Bayes? 2 Divergence between distributions 3 Estimating variational approximations 4 Example with PyMC3 5 Evaluating variational approximations 6 Alternative distance measures 7 Summary
  • 74. 74/81 Summary • The idea of variational inference is to find a probability distribution that minimizes the divergence to the posterior • The KL-divergence cannot be minimized directly, as it depends on the intractable posterior • Maximizing the ELBO is equivalent to minimizing the KL-divergence
  • 75. 75/81 Summary • To maximize the ELBO, we have to make assumptions • The most common assumption is the mean-field assumption, which treats parameters as independent • An easy, but inefficient, algorithm to maximize the ELBO is CAVI • By taking a subsample and then updating the data structure, CAVI is more efficient and precise (SVI)
  • 76. 76/81 Summary • With CAVI and SVI, the model families we can use are restricted and the ELBO has to be derived by hand • Black-box variational inference solves these two problems • PyMC3 and Stan use ADVI which is based on BBVI. They provide mean-field and full rank alternatives • There are alternatives to the KL-divergence, but are not widely used in common frameworks
  • 77. 77/81 Summary What do I really need to know about variational inference?
  • 78. 78/81 Twitter: @ flaviomorelli GitHub: @flaviomorelli LinkedIn: Flavio Morelli Email: flavio.morelli@fu-berlin.de
  • 79. 79/81 Bibliography I Blei, D. M. (2017). Variational Inference: Foundations and Innovations. Retrieved 2019-11-30, from https://youtu.be/Dv86zdWjJKQ Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational Inference: A Review for Statisticians. Journal of the American Statistical Association, 112(518), 859–877. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022. Ioannides, A. (2018). Regression in PYMC3 using MCMC & VariationalInference. Retrieved 2019-11-25, from https:// alexioannides.com/2018/11/07/bayesian-regression -in-pymc3-using-mcmc-variational-inference
  • 80. 80/81 Bibliography II Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic Differentiation Variational Inference. Journal of Machine Learning Research, 18(14), 1–45. Minka, T. (2005). Divergence measures and message passing (Tech. Rep. No. MSR-TR-2005-173). Microsoft Research. Ormerod, J. T., & Wand, M. P. (2010). Explaining Variational Approximations. The American Statistician, 64(2), 140–153. Ranganath, R., Gerrish, S., & Blei, D. M. (2014). Black Box Variational Inference. In Proceedings of Artificial Intelligence and Statistics. Robbins, H., & Monro, S. (1951). A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3), 400–407. Sason, I. (2018). On f-Divergences: Integral Representations, Local Behavior, and Inequalities. Entropy, 20(5), 383.
  • 81. 81/81 Bibliography III Sia, X. Y. S. (2019). Coordinate Ascent Mean-field Variational Inference (Univariate Gaussian Example). Retrieved 2019-11-20, from https://suzyahyah.github.io/ bayesian%20inference/machine%20learning/ variational%20inference/2019/03/20/CAVI.html Wang, D., Liu, H., & Liu, Q. (2018). Variational Inference with Tail-adaptive f-Divergence. In Advances in Neural Information Processing Systems. Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (2018). Yes, but Did It Work?: Evaluating Variational Inference. In Proceedings of the 35th International Conference on Machine Learning.