Modeling uncertainty in deep learning

Uncertainties in
Deep Learning
in a nutshell
Sungjoon Choi

CPSLAB, SNU

Introduction
2
The ﬁrst fatality from an assisted driving system

Introduction
3
Google Photos identiﬁed two black people as 'gorillas'

Contents
6
Bayesian Neural Network
with variational inference
and re-parametrization trick
Bootstrapping-
based uncertainty
modeling
Bayesian Neural
Network
modeling
epistemic and
aleatoric
uncertainties
Application to
Safe RL
Novelty Detection
using
auto-encoder
Mixture Density
Network modeling
epistemic and
aleatoric
uncertainties

7
Y. Gal, Uncertainty in Deep Learning, 2016

Gal (2016)
8
Model uncertainty
1. Given a model trained with several pictures of dog breeds, a user asks the model
to decide on a dog breed using a photo of a cat.

Gal (2016)
9
Model uncertainty
2. We have three diﬀerent types of images to classify, cat, dog, and cow, where only
cat images are noisy.

Gal (2016)
10
Model uncertainty
3. What is the best model parameters that best explain a given dataset? what
model structure should we use?

Gal (2016)
11
Model uncertainty

Gal (2016)
12
Model uncertainty
Out of distribution test data
Aleatoric uncertainty
Epistemic uncertainty

Gal (2016)
13
Dropout as a Bayesian approximation
“We show that a neural network with arbitrary depth and non-linearities, with dropout
applied before every weight layer, is mathematically equivalent to an approximation
to a well known Bayesian model.”

Gal (2016)
14
Dropout as a Bayesian approximation
The resulting formulations are surprisingly simple.

Gal (2016)
15
Posterior p(w|X, Y) Prior p(w)
In Bayesian inference, we aim to ﬁnd a posterior distribution over the random variables of
our interest given a prior distribution which is intractable in many case.

Gal (2016)
16
p(y⇤
|x⇤
, X, Y) =
Z
p(y⇤
|x⇤
, w)p(w|X, Y)dw
Posterior p(w|X, Y)
Inference
Prior p(w)
Note that even when given a posterior distribution, exact inference is very likely to be
intractable as it contains integral with respect to the distribution over latent variables.

Gal (2016)
17
p(y⇤
|x⇤
, X, Y) =
Z
p(y⇤
|x⇤
, w)p(w|X, Y)dw
Posterior p(w|X, Y)
Inference
Variational Inference KL (q✓(w)||p(w|X, Y)) =
Z
q✓(w) log
q✓(w)
p(w|X, Y)
dw
Prior p(w)
Variational inference is used to approximate the (intractable) posterior distribution
with (tractable) variational distribution with respect to the KL divergence.

Gal (2016)
18
p(y⇤
|x⇤
, X, Y) =
Z
p(y⇤
|x⇤
, w)p(w|X, Y)dw
Posterior p(w|X, Y)
Inference
Z
q✓(w) log
q✓(w)
p(w|X, Y)
dw
ELBO
Z
q✓(w) log p(Y|X, w)dw KL(q✓(w)||p(w))
Prior p(w)
Minimizing the KL divergence is equivalent to maximizing the evidence lower bound
(ELBO) which also contains the integral with respect to the distribution over latent
variables.

Gal (2016)
20
p(y⇤
|x⇤
, X, Y) =
Z
p(y⇤
|x⇤
, w)p(w|X, Y)dw
Posterior p(w|X, Y)
Inference
Z
q✓(w) log
q✓(w)
p(w|X, Y)
dw
ELBO
Z
ELBO (reparametrization)
Z
p(✏) log p(Y|X, w)d✏ KL(q✓(w)||p(w))
w = g(✓, ✏)
Prior p(w)

Gal (2016)
21
ELBO
Z
Re-parametrized ELBO
Z
w = g(✓, ✏) (Re-parametrization trick)

Gal (2016)
22
Gaussian process approximation
MC approximation
Apply to Gaussian processes
GP Marginal likelihood

Gal (2016)
23
Bayesian Neural Network with dropout

Gal (2016)
24
p(y|fg(✓,ˆ✏)
(x)) = N(y; ˆy✓(x), ⌧ 1
ID)
Likelihood

Gal (2016)
25
Re-parametrized ELBO
Z
Re-parametrized likelihood Prior

Gal (2016)
26

Gal (2016)
27
Predictive mean and uncertainties

28
P. McClure, Representing Inferential Uncertainty in Deep
Neural Networks Through Sampling, 2017

McClure & Kriegeskorte (2017)
29
Diﬀerent variational distributions

McClure & Kriegeskorte (2017)
30
Results on MNIST
Without noticeable performance degradation, the proposed methods are able to
quantify the level of uncertainty.

31
Anonymous, Bayesian Uncertainty Estimation for
Batch Normalized Deep Networks, 2018

Anonymous (2018)
32
Monte Carlo Batch Normalization (MCBN)

Anonymous (2018)
33
Batch normalized deep nets as Bayesian modeling
Learnable parameter
Stochastic parameter

Anonymous (2018)
34
Batch normalized deep nets as Bayesian modeling

Anonymous (2018)
35
MCBN to Bayesian SegNet

36
B. Lakshminarayanan et al., Simple and Scalable Predictive Uncertainty
Estimation using Deep Ensembles, 2017

Lakshminarayanan et al. (2017)
37
Proper scoring rule
“A scoring rule assigns a numerical score to a predictive distribution
rewarding better calibrated predictions over worse. (…) It turns out many
common neural network loss functions are proper scoring rules.”

38
Density network
x
µ✓(x) ✓(x)
L =
1
N
NX
i=1
log N(yi; µ✓(xi), 2
✓(xi))
f✓(x)

39
Density network
L =
1
N
NX
i=1
log N(yi; µ✓(xi), 2
✓(xi))

40
Adversarial training with a fast gradient sign method
“Adversarial training can be also be interpreted as a computationally
eﬃcient solution to smooth the predictive distributions by increasing the
likelihood of the target around an neighborhood of the observed training
examples.”

41
Proposed method
Train M diﬀerent models

42
Proposed method
Empirical variance (5) Density network (1) Adversarial training Deep ensemble (5)

43
A. Kendal and Y. Gal, What Uncertainties Do We Need in
Bayesian Deep Learning for Computer Vision?, 2017

Kendal & Gal (2017)
44
Aleatoric & epistemic uncertainties

Kendal & Gal (2017)
45
ˆW ⇠ q(W)
x
ˆy ˆW(x) ˆ2
ˆW
(x)
[ˆy, ˆ2
] = f
ˆW
(x)
L =
1
N
NX
i=1
log N(yi; ˆy ˆW(x), ˆ2
ˆW
(x))

Kendal & Gal (2017)
46
Heteroscedastic uncertainty as loss attenuation
ˆW ⇠ q(W)
x
ˆy ˆW(x) ˆ2
ˆW
(x)
[ˆy, ˆ2
] = f
ˆW
(x)

Kendal & Gal (2017)
47
ˆW ⇠ q(W)
x
ˆy ˆW(x) ˆ2
ˆW
(x)
[ˆy, ˆ2
] = f
ˆW
(x)
Var(y) ⇡
1
T
TX
t=1
ˆy2
t
TX
t=1
ˆyt
!2
+
1
T
TX
t=1
ˆ2
t
Epistemic unct. Aleatoric unct,

Kendal & Gal (2017)
48
Results

49
G. Khan et al., Uncertainty-Aware Reinforcement
Learning from Collision Avoidance, 2016

Khan et al. (2016)
50
Uncertainty-Aware Reinforcement Learning
Uncertainty-aware collision prediction model

Khan et al. (2016)
51
“Uncertainty is based on bootstrapped neural networks using dropout.”
Bootstrapping?

- Generate multiple datasets using sampling with replacement.

- The intuition behind bootstrapping is that, by generating multiple populations
and training one model per population, the models will agree in high-density
areas (low uncertainty) and disagree in low-density areas (high uncertainty).
Dropout?

- “Dropout can be viewed as an economical approximation of an ensemble
method (such as bootstrapping) in which each sampled dropout mask
corresponds to a diﬀerent model.”

Khan et al. (2016)
52
Train B diﬀerent models

Richer & Roy (2017)
54
Introduction
State-of-the-art deep learning methods are known to produce erratic or unsafe
predictions when faced with novel inputs. Furthermore, recent ensemble, bootstrap
and dropout methods for quantifying neural network uncertainty may not eﬃciently
provide accurate uncertainty estimates when queried with inputs that are very diﬀerent
from their training data.

We use a conventional feedforward neural network to predict collisions based on
images observed by the robot, and we use an autoencoder to judge whether those
images are similar enough to the training data for the resulting neural network
predictions to be trusted.

Richer & Roy (2017)
55
Novelty detection
Use the reconstruction error as a measure of novelty.

Richer & Roy (2017)
56
Novelty detection
Use the reconstruction error as a measure of novelty.

Richer & Roy (2017)
57
Novelty detection

Richer & Roy (2017)
58
Novelty detection

Richer & Roy (2017)
59
Learning to predict collision
c
ˆmt
it
at
fc(c|it, at)
fp(c| ˆmt, at)
fn(it)
where
: collision
: estimated map
: input image
: action
: neural net trained to predict collision
: prior estimate of collision probability
: novelty detection

Richer & Roy (2017)
60
Experiments
“Using an autoencoder as a measure of uncertainty in our collision prediction
network, we can transition intelligently between the high performance of the learned
model and the safe, conservative performance of a simple prior, depending on
whether the system has been trained on the relevant data.”

Richer & Roy (2017)
61
Experiments
“In the hallway training environment, we achieved a mean speed of 3.26 m/s and a top
speed over 5.03 m/s. This result signiﬁcantly exceeds the maximum speeds achieved
when driving in this environment under the prior estimate of collision probability before
performing any learning.”

“On the other hand, in the novel environment, for which our model was untrained, the
novelty detector correctly identiﬁed every image as being unfamiliar. In the novel
environment, we achieved a mean speed of 2.49 m/s and a maximum speed of 3.17 m/s.”

62
S. Choi et al., Uncertainty-Aware Learning from Demonstration Using
Mixture Density Networks with Sampling-Free Variance Modeling, 2017
All in this room!

Choi et al. (2017)
63
Mixture density networks
x
µ1(x) µ2(x) µ3(x)⇡1(x) ⇡2(x) ⇡3(x) 1(x) 2(x) 3(x)
f ˆW(x)
L =
1
N
NX
i=1
log
KX
j=1
⇡j(xi)N(yi; µj(xi), 2
j (x))

64
Choi et al. (2017)
Mixture density networks

65
Choi et al. (2017)
Explained and unexplained variance
“We propose a sampling-free variance modeling method using a mixture
density network which can be decomposed into explained variance and
unexplained variance.”

66
Choi et al. (2017)
“In particular, explained variance represents model uncertainty whereas
unexplained variance indicates the uncertainty inherent in the process, e.g.,
measurement noise.”

67
Choi et al. (2017)
Analysis with Synthetic Examples
The proposed uncertainty modeling method is analyzed in three diﬀerent
synthetic examples: 1) absence of data, 2) heavy noise, and 3) composition
of functions.

68
Choi et al. (2017)
Analysis with Synthetic Examples

69
Choi et al. (2017)
We present uncertainty-aware learning from demonstration by using the
explained variance as a switching criterion between trained policy and rule-
based safe mode.
Unexplained
Variance
Explained
Variance

70
Choi et al. (2017)
Driving experiments

Modeling uncertainty in deep learning

More Related Content

What's hot (20)

Similar to Modeling uncertainty in deep learning (20)

More from Sungjoon Choi (20)

Recently uploaded (20)

Modeling uncertainty in deep learning