Training and Inference for Deep Gaussian Processes

Training and Inference for Deep Gaussian Processes
Keyon Vafa
April 26, 2016
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 1 / 50

Motivation
An ideal model for prediction is

Motivation
accurate

Motivation
accurate
computationally eﬃcient

Motivation
accurate
easy to tune without overﬁtting

Motivation
accurate
easy to tune without overﬁtting
able to provide certainty estimates

Motivation
This thesis focuses on one particular class of prediction models, deep
Gaussian processes for regression. They are a new model, having been
introduced by Damianou and Lawrence in 2013.

Motivation
This thesis focuses on one particular class of prediction models, deep
Gaussian processes for regression. They are a new model, having been
introduced by Damianou and Lawrence in 2013.
Exact inference is intractable. In this thesis, we introduce a new method to
learn deep GPs, the Deep Gaussian Process Sampling algorithm (DPGS).

Motivation
The DGPS algorithm

Motivation
The DGPS algorithm
is more straightforward than existing methods

Motivation
The DGPS algorithm
can more easily adapt to using arbitrary kernels

Motivation
The DGPS algorithm
relies on Monte Carlo sampling to circumvent the intractability hurdle

Motivation
The DGPS algorithm
relies on Monte Carlo sampling to circumvent the intractability hurdle
uses pseudo data to ease the computational burden

Gaussian Processes
Table of Contents
1 Gaussian Processes
2 Deep Gaussian Processes
3 Implementation
4 Experiments and Analysis
5 Conclusion

Gaussian Processes
Deﬁnition of a Gaussian Process
A function f is a Gaussian process (GP) if any ﬁnite set of values
f (x1), . . . , f (xN) has a multivariate normal distribution.

Gaussian Processes
The inputs {xn}N
n=1 can be vectors from any arbitrary sized domain.

Gaussian Processes
The inputs {xn}N
n=1 can be vectors from any arbitrary sized domain.
Speciﬁed by a mean function m(x) and a covariance function k(x, x )
where
m(x) = E[f (x)]
k(x, x ) = Cov(f (x), f (x )).

Gaussian Processes
Covariance Function
The covariance function (or kernel) determines the smoothness and
stationarity of functions drawn from a GP.

Gaussian Processes
Covariance Function
The squared exponential covariance function has the following form:
k(x, x ) = σ2
f exp −
1
2
(x − x )T
M(x − x )

Gaussian Processes
Covariance Function
The squared exponential covariance function has the following form:
k(x, x ) = σ2
f exp −
1
2
(x − x )T
M(x − x )
When M is a diagonal matrix, the elements on the diagonal are
known as the length-scales, denoted by l−2
i . σ2
f is known as the signal
variance.

Gaussian Processes
Sampling from a GP
x
f(x) Signal variance 1.0, Length-scale 0.2
x
f(x)
Signal variance 1.0, Length-scale 1.0
x
f(x)
x
f(x)
x
f(x) Signal variance 1.0, Length-scale 1.0
x
f(x)
Random samples from GP priors. The length-scale controls the
smoothness of our function, while the signal variance controls the
deviation from the mean.

Gaussian Processes
GPs for Regression
Setup: We are given a set of inputs X ∈ RN×D and corresponding
outputs y ∈ RN, the function values from a GP evaluated at X. We
assume a mean function m(x) and a covariance function k(x, x ),
which rely on parameters θ.

Gaussian Processes
GPs for Regression
We would like to learn the optimal θ, and estimate the function
values y∗ for a set of new inputs X∗.

Gaussian Processes
GPs for Regression
To learn θ, we optimize the marginal likelihood:
P(y|X, θ) = N(0, KXX ).

Gaussian Processes
GPs for Regression
To learn θ, we optimize the marginal likelihood:
P(y|X, θ) = N(0, KXX ).
We can then use the multivariate normal conditional distribution to
evaluate the predictive distribution:
P(y∗|X∗, X, y, θ) ∼ N(KX∗XK−1
XXy, KX∗X∗ − KX∗XK−1
XXKXX∗ ).

Gaussian Processes
GPs for Regression
Note this is all true because we are assuming the outputs correspond to a
Gaussian process. We therefore make the following assumption:
y
y∗
∼ N
0
0
,
KXX KXX∗
KX∗X KX∗X∗
.

Gaussian Processes
GPs for Regression
Note this is all true because we are assuming the outputs correspond to a
Gaussian process. We therefore make the following assumption:
y
y∗
∼ N
0
0
,
KXX KXX∗
KX∗X KX∗X∗
.
Computing P(y|X) and P(y∗|X∗, X, y) only requires matrix algebra on the
above assumption.

Gaussian Processes
Example of a GP for Regression
xf(x)
Gaussian Process Regression
x
Outputs Data
Figure: On the left, data from a sigmoidal curve with noise. On the right, samples
from a GP trained on the data (represented by ‘x’), using a squared exponential
covariance function.

Deep Gaussian Processes
Table of Contents
3 Implementation
5 Conclusion

Deﬁnition of a Deep Gaussian Process
Formally, we deﬁne a deep Gaussian Process as the composition of GPs:
f(1:L)
(x) = f(L)
(f(L−1)
(. . . f(2)
(f(1)
(x)) . . . ))
where f
(l)
d ∼ GP 0, k
(l)
d (x, x ) for f
(l)
d ∈ f(l)
.

Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of units
at layer l

Deep GP Notation
at layer l
For an L layer deep GP, we have

Deep GP Notation
at layer l
one input layer xn ∈ RD(0)

Deep GP Notation
at layer l
L − 1 hidden layers {hl
n}L−1
l=1

Deep GP Notation
at layer l
n}L−1
l=1
one output layer yn, which we assume to be 1-dimensional.

Deep GP Notation
at layer l
n}L−1
l=1
one output layer yn, which we assume to be 1-dimensional.
All layers are completely connected by GPs, each with their own
kernel.

Example: Two-Layer Deep GP
ynhnxn
f g

Example: Two-Layer Deep GP
ynhnxn
f g
We have a one dimensional input, xn, a one dimensional hidden unit, hn,
and a one dimensional output, yn. This two-layer network consists of two
GPs, f and g where
hn = f (xn), where f ∼ GP(0, k(1)
(x, x ))
and
yn = g(hn), where g ∼ GP(0, k(2)
(h, h )).

Example: More Complicated Model
y
h
(2)
1
h
(2)
2
h
(2)
3
h
(2)
4
h
(1)
1
h
(1)
2
h
(1)
3
h
(1)
4
x1
x2
x3
Graphical representation of a more complicated deep GP architecture.
Every edge corresponds to a GP between units, as the outputs of each
layer are the inputs of the following layer. Our input data is 3-dimensional,
while the two hidden layers in this model each have 4 hidden units.

Sampling From a Deep GP
6 4 2 0 2 4 6
x
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
2.5
g(f(x))
Full Deep GP
6 4 2 0 2 4 6
x
2
1
0
1
2
3
4
f(x)
Layer 1: Length-scale 0.5
2 1 0 1 2 3 4
f(x)
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
2.5
g(f(x))
Layer 2: Length-scale 1.0
Samples from deep GPs. As opposed to single-layer GPs, a deep GP can
model non-stationary functions (functions whose shape changes along the
input space) without the use of a non-stationary kernel.

Comparison with Neural Networks
Similarities: deep architectures, completely connected, single-layer
GPs correspond to two-layer neural networks with random weights
and inﬁnitely many hidden units.

Comparison with Neural Networks
Similarities: deep architectures, completely connected, single-layer
GPs correspond to two-layer neural networks with random weights
and inﬁnitely many hidden units.
Diﬀerences: deep GP is nonparametric, no activation functions, must
specify kernels, training is intractable

Implementation
Table of Contents
3 Implementation
5 Conclusion

Implementation
FITC Approximation for Single-Layer GP
The Fully Independent Training Conditional Approximation
(FITC) circumvents the O(N3) training time for a single-layer GP by
introducing pseudo data, points that are not in the data set but can
be chosen to approximate the function (Snelson and Ghahramani,
2005).

Implementation
2005).
We introduce M pseudo inputs ¯X = {¯xm}M
m=1 and the corresponding
pseudo outputs ¯y = {¯ym}M
m=1, which correspond to the function
values at the pseudo inputs.

Implementation
2005).
We introduce M pseudo inputs ¯X = {¯xm}M
m=1 and the corresponding
pseudo outputs ¯y = {¯ym}M
m=1, which correspond to the function
values at the pseudo inputs.
Key assumption: conditioned on the pseudo data, the output values
are independent.

Implementation
We assume a prior
P(¯y|¯X) = N (0, K¯X¯X) .

Implementation
We assume a prior
P(¯y|¯X) = N (0, K¯X¯X) .
Training takes time O(NM2), and testing requires O(M2).

Implementation
FITC Example
x
f(x)
5 Pseudo Parameters
x
f(x)
10 Pseudo Parameters
Figure: The predictive mean of a GP trained on sigmoidal data using the FITC
approximation. On the left, we use 5 pseudo data points, while on the right, we
use 10.

Implementation
Learning Deep GPs is Intractable
ynhnxn
f
θ(1)
g
θ(2)
Example: two-layer model, with inputs X, outputs y, and hidden layer H
(which is N × 1). Ideally, a Bayesian treatment would allow us to integrate
out the hidden function values to evaluate
P(y|X, θ) = P(y|H, θ(2)
)P H|X, θ(1)
dH
= N (0, KHH) N (0, KXX) dH.

Implementation
Learning Deep GPs is Intractable
ynhnxn
f
θ(1)
g
θ(2)
Example: two-layer model, with inputs X, outputs y, and hidden layer H
(which is N × 1). Ideally, a Bayesian treatment would allow us to integrate
out the hidden function values to evaluate
P(y|X, θ) = P(y|H, θ(2)
)P H|X, θ(1)
dH
= N (0, KHH) N (0, KXX) dH.
Evaluating the integrals of Gaussians with respect to kernel functions is
intractable.

Implementation
DPGS Algorithm Overview
The Deep Gaussian Process Sampling algorithm relies on two central ideas:

Implementation
We sample predictive means and covariances to approximate the
marginal likelihood, relying on automatic diﬀerentiation techniques to
evaluate the gradients and optimize our objective.

Implementation
We sample predictive means and covariances to approximate the
marginal likelihood, relying on automatic diﬀerentiation techniques to
evaluate the gradients and optimize our objective.
We replace every GP with the FITC GP, so the time complexity for L
layers and H hidden units per layer is O(N2MLH) as opposed to
O(N3LH).

Implementation
Related Work
Damianou and Lawrence (2013) also use the FITC approximation at
every layer, but they perform inference with approximate variational
marginalization. Subsequent methods (Hensman and Lawrence, 2014;
Dai et al., 2015; Bui et al., 2016) also use variational approximations.

Implementation
Related Work
Damianou and Lawrence (2013) also use the FITC approximation at
every layer, but they perform inference with approximate variational
marginalization. Subsequent methods (Hensman and Lawrence, 2014;
Dai et al., 2015; Bui et al., 2016) also use variational approximations.
These methods are able to integrate out the pseudo outputs at each
layer, but they rely on integral approximations that restrict the kernel.
Meanwhile, the DGPS uses Monte Carlo sampling, which is easier to
implement, more intuitive to understand, and can extent easily to
most kernels.

Implementation
Sampling Hidden Values
For inputs X, we calculate the predictive mean and covariance for
every unit in the ﬁrst hidden layer. We then sample values from each
predictive distribution

Implementation
For every hidden layer thereafter, we take the samples from the
previous layer, calculate the predictive mean and covariance, and
repeat sampling until the ﬁnal layer.

Implementation
For every hidden layer thereafter, we take the samples from the
previous layer, calculate the predictive mean and covariance, and
repeat sampling until the ﬁnal layer.
We use K diﬀerent samples {(˜µk, ˜Σk)}K
k=1 to approximate the
marginal likelihood:
P(y|X) ≈
K
k=1
P(y|˜µk, ˜Σk) =
K
k=1
N(˜µk, ˜Σk)

Implementation
FITC for Deep GPs
To make ﬁtting more scalable, we replace every GP in the model with
a FITC GP

Implementation
FITC for Deep GPs
a FITC GP
For each GP, corresponding to hidden unit d in layer l, we introduce
pseudo inputs ¯X
(l)
d and corresponding pseudo outputs ¯y
(l)
d .

Implementation
FITC for Deep GPs
a FITC GP
For each GP, corresponding to hidden unit d in layer l, we introduce
pseudo inputs ¯X
(l)
d and corresponding pseudo outputs ¯y
(l)
d .
With the addition of the pseudo data, we are required to learn the
following set of parameters:
Θ = ¯X
(l)
d ,¯y
(l)
d , θ
(l)
d
D(l)
d=1
L
l=1
.

Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXn
f
¯X
(1)
,¯y(1), θ(1)
g
¯X
(2)
,¯y(2), θ(2)
Our goal is to learn

Implementation
ynHnXn
f
¯X
(1)
,¯y(1), θ(1)
g
¯X
(2)
,¯y(2), θ(2)
{(¯X
(l)
,¯y(l))}2
l=1, the pseudo data for each layer

Implementation
ynHnXn
f
¯X
(1)
,¯y(1), θ(1)
g
¯X
(2)
,¯y(2), θ(2)
{(¯X
(l)
,¯y(l))}2
l=1, the pseudo data for each layer
θ(1)
and θ(2)
, the kernel parameters for f and g

Implementation
ynHnXn
f
¯X
(1)
,¯y(1), θ(1)
g
¯X
(2)
,¯y(2), θ(2)
To sample values H from the hidden layer, we use the FITC approximation
and assume
P H|X, ¯X
(1)
,¯y(1)
= N µ(1)
, Σ(1)

Implementation
ynHnXn
f
¯X
(1)
,¯y(1), θ(1)
g
¯X
(2)
,¯y(2), θ(2)
and assume
P H|X, ¯X
(1)
,¯y(1)
= N µ(1)
, Σ(1)
where
µ(1)
= KX¯X
(1) K−1
¯X
(1) ¯X
(1)¯y(1)
Σ(1)
= diag KXX − KX¯X
(1) K−1
¯X
(1) ¯X
(1) K¯X
(1)
X
.

Implementation
ynHnXn
f
¯X
(1)
,¯y(1), θ(1)
g
¯X
(2)
,¯y(2), θ(2)
and assume
P H|X, ¯X
(1)
,¯y(1)
= N µ(1)
, Σ(1)
where
µ(1)
= KX¯X
(1) K−1
¯X
(1) ¯X
(1)¯y(1)
Σ(1)
= diag KXX − KX¯X
(1) K−1
¯X
(1) ¯X
(1) K¯X
(1)
X
.
We obtain K samples, {˜Hk}K
k=1 from the above distribution.

Implementation
ynHnXn
f
¯X
(1)
,¯y(1), θ(1)
g
¯X
(2)
,¯y(2), θ(2)
For each sample ˜Hk, we can approximate
P y|˜Hk, ¯X
(2)
,¯y(2)
≈ N ˜µ(2)
, ˜Σ
(2)

Implementation
ynHnXn
f
¯X
(1)
,¯y(1), θ(1)
g
¯X
(2)
,¯y(2), θ(2)
For each sample ˜Hk, we can approximate
P y|˜Hk, ¯X
(2)
,¯y(2)
≈ N ˜µ(2)
, ˜Σ
(2)
where
˜µ(2)
= K˜Hk
¯X
(2) K−1
¯X
(2) ¯X
(2)¯y(2)
˜Σ
(2)
= diag K˜Hk
˜Hk
− K˜Hk
¯X
(2) K−1
¯X
(2) ¯X
(2) K¯X
(2) ˜Hk
.

Implementation
Thus, we can approximate the marginal likelihood with our samples:
P(y|X, Θ) ≈
1
K
K
k=1
P y|˜Hk, ¯X
(2)
,¯y(2)
.

Implementation
Thus, we can approximate the marginal likelihood with our samples:
P(y|X, Θ) ≈
1
K
K
k=1
P y|˜Hk, ¯X
(2)
,¯y(2)
.
Incorporating the prior over the pseudo outputs into our objective, we
have:
L(y|X, Θ) = log P(y|X, Θ) +
L
l=1
D(l)
d=1
log P ¯y
(l)
d
¯X
(l)
d .

Experiments and Analysis
Table of Contents
3 Implementation
5 Conclusion

Step Function
We test on a step function with noise: X ∈ [−2, 2], yi = sign(xi ) + i ,
where i ∼ N(0, .01).

Step Function
We test on a step function with noise: X ∈ [−2, 2], yi = sign(xi ) + i ,
where i ∼ N(0, .01).
The non-stationarity of a step function is appealing from a deep GP
perspective.

Step Function
x
f(x)
Samples from a Single-Layer GP
Figure: Functions sampled from a single-layer GP. Evidently, the predictive draws
do not fully capture the shape of the step function.

Step Function
Figure: Predictive draws from a single-layer GP and a two-layer deep GP.

Step Function
Figure: Predictive draws from a three-layer deep GP.

Step Function
x
f(x)
Random Initialization
x
Hiddenvalues
Hidden values
f(x)
x
f(x)
Smart Initialization
xHiddenvalues
Hidden values
f(x)
Impact of Parameter Initializations on Predictive Draws

Step Function
1 2 3
Number of Layers
0.0
0.2
0.4
0.6
0.8
1.0
MeanSquaredError
1 2 3
Number of Layers
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3
Number of Layers
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3
3
2
1
0
1
2
LogLikelihoodperData
50 Data Points
1 2 3
3
2
1
0
1
2 100 Data Points
1 2 3
3
2
1
0
1
2 200 Data Points
Figure: Experimental results measuring the test log-likelihood per data and test
mean squared error of the noisy step function. We vary the number of layers used
in the model, along with the number of data points used in the original step
function (which is divided 80/20 into train/test). We run 10 trials at each
combination.

Step Function
Occasionally, models with deeper architectures outperform those that are
more shallow, yet they also possess the widest distributions and trials with
the worst results.
2 1 0 1 2
Train Log-Likelihood per Data
2
1
0
1
2
TestLog-LikelihoodperData
Layers
1
2
3
0.0 0.2 0.4 0.6 0.8 1.0
Train MSE
0.0
0.2
0.4
0.6
0.8
1.0
TestMSE
Layers
1
2
3
Figure: Test set log-likelihoods per data and mean squared errors plotted against
their training set counterparts for the step function experiment. Overﬁtting does
not appear to be a problem.

Step Function
Overﬁtting does not appear to be a problem.

Step Function
If we can successfully optimize our objective, deeper architectures are
better suited at learning the noisy step function than shallower ones.

Step Function
If we can successfully optimize our objective, deeper architectures are
better suited at learning the noisy step function than shallower ones.
However, it becomes more diﬃcult to train and successfully optimize
as the number of layers grows and the number of parameters
increases.

Step Function
x
f(x)
Random Seed 66
x
Hiddenvalues1
Hidden values 1
Hiddenvalues2
Hidden values 2
f(x)
x
f(x)
Random Seed 0
x
Hiddenvalues1
Hidden values 1
Hiddenvalues2
Hidden values 2
f(x)
Figure: Predictive draws from two identical three-layer models, albeit with
diﬀerent random parameter initializations.

Step Function
Ways to combat optimization challenges:

Step Function
Using random restarts

Step Function
Decreasing the number of model parameters

Step Function
Trying diﬀerent optimization methods

Step Function
Trying diﬀerent optimization methods
Experimenting with more diverse architectures, i.e. increasing the
dimension of the hidden layer

Toy Non-Stationary Data
We create toy non-stationary data to evaluate a deep GP’s ability to
learn a non-stationary function.

We divide the input space into three regions: X1 ∈ [−4, −3],
X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points.

We divide the input space into three regions: X1 ∈ [−4, −3],
X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points.
Sample from a GP with length-scale l = .25 for regions X1 and X3,
using l = 2 for region X2.

x
Outputs
Data
x
f(x)
2-Layer Deep GP
x
Hiddenvalues
Hidden values
f(x)
x
f(x)
1-Layer GP
Predictive Draws for Toy Non-Stationary Data
Figure: Predictive draws from the single-layer and 2-layer models for toy
non-stationary data with squared exponential kernels.

x
f(x)
Non-Stationary Data: 3-Layer Deep GP
Figure: The optimization for a 3-layer model can get stuck in a local optimum,
and although the predictive draws are non-stationary, our predictions are poor at
the tails.

Motorcycle Data
94 points, where the inputs are time in milliseconds since impact of a
motorcycle accident and outputs are corresponding helmet
accelerations.

Motorcycle Data
94 points, where the inputs are time in milliseconds since impact of a
motorcycle accident and outputs are corresponding helmet
accelerations.
Dataset is somewhat non-stationary, as the accelerations are constant
early on but after a certain time become more varying.

Motorcycle Data
Time
Acceleration
Data
Time
Acceleration
2-Layer Deep GP
Time
Hiddenvalues
Hidden values
Acceleration
Time
Acceleration
1-Layer GP
Predictive Draws for Motorcycle Data
Figure: Predictive draws from the single-layer and 2-layer models trained on
motorcycle data with squared exponential kernels.

Conclusion
Table of Contents
3 Implementation
5 Conclusion

Conclusion
Future Directions
Natural extensions include

Conclusion
Future Directions
Trying diﬀerent optimization methods to avoid getting stuck in local
optima

Conclusion
Future Directions
optima
Introducing variational parameters so we do not have to learn pseudo
outputs

Conclusion
Future Directions
optima
outputs
Extending model to classiﬁcation

Conclusion
Future Directions
optima
outputs
Extending model to classiﬁcation
Exploring properties of more complex architectures, and evaluate the
model likelihood to choose optimal conﬁguration

Conclusion
Acknowledgments
A huge thank-you to Sasha Rush, Finale Doshi-Velez, David Duvenaud,
and Miguel Hern´andez-Lobato. This thesis would not be possible without
your help and support!

Training and Inference for Deep Gaussian Processes

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Training and Inference for Deep Gaussian Processes (20)

Recently uploaded (20)

Training and Inference for Deep Gaussian Processes